adamreeve commented on code in PR #40329:
URL: https://github.com/apache/arrow/pull/40329#discussion_r1525707563
##########
cpp/src/parquet/encryption/crypto_factory.cc:
##########
@@ -172,8 +201,8 @@ std::shared_ptr<FileDecryptionProperties>
CryptoFactory::GetFileDecryptionProper
const KmsConnectionConfig& kms_connection_config,
const DecryptionConfiguration& decryption_config, const std::string&
file_path,
const std::shared_ptr<::arrow::fs::FileSystem>& file_system) {
- auto key_retriever = std::make_shared<FileKeyUnwrapper>(
- &key_toolkit_, kms_connection_config,
decryption_config.cache_lifetime_seconds,
+ auto key_retriever = std::make_shared<CryptoFactoryFileKeyRetriever>(
Review Comment:
No this isn't related to having access to the CryptoFactory instance, I
think this discussion has gotten a bit sidetracked from the main point of the
PR.
My use case is probably a bit non-standard as I'm trying to provide a .NET
binding to the encryption API without users having to deal with manual memory
management. I have no control over the CryptoFactory lifetime so if users of
ParquetSharp misuse the API they could get a segfault, which isn't usually
expected in a managed language/runtime.
This got me thinking, I wonder how pyarrow handles this. And it turns out
pyarrow has the same issue. This code causes a hang on my machine, but works if
I don't have the `del crypto_factory` line:
```python
import base64
import gc
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.parquet.encryption as pe
class TestKmsClient(pe.KmsClient):
""" Dummy KMS client that base64 encodes keys """
def __init__(self, kms_connection_configuration):
pe.KmsClient.__init__(self)
def wrap_key(self, key_bytes, master_key_identifier):
return f"{base64.b64encode(key_bytes).decode('utf-8')}"
def unwrap_key(self, wrapped_key, master_key_identifier):
return base64.b64decode(wrapped_key)
# Create test data
x = pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], type=pa.int32())
y = pa.array([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
type=pa.float32())
schema = pa.schema([
pa.field("x", pa.int32(), nullable=False),
pa.field("y", pa.float32(), nullable=False),
])
table = pa.table([x, y], schema=schema)
# Write encrypted file
crypto_factory = pe.CryptoFactory(lambda config: TestKmsClient(config))
config = pe.KmsConnectionConfig()
encryption_config = pe.EncryptionConfiguration(
footer_key="Key0",
column_keys={
"Key0": ["x", "y"],
},
double_wrapping=True,
internal_key_material=True
)
encryption_properties = crypto_factory.file_encryption_properties(config,
encryption_config)
with pq.ParquetWriter("./encrypted_py.parquet", table.schema,
encryption_properties=encryption_properties) as writer:
writer.write_table(table)
# Read encrypted file
decryption_config = pe.DecryptionConfiguration()
decryption_properties = crypto_factory.file_decryption_properties(config,
decryption_config)
del crypto_factory
gc.collect()
with pq.ParquetFile("./encrypted_py.parquet",
decryption_properties=decryption_properties) as reader:
read_table = reader.read()
print(read_table)
```
A similar example using a more realistic encrypted file caused a segfault.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]