adamreeve commented on code in PR #40329:
URL: https://github.com/apache/arrow/pull/40329#discussion_r1525707563


##########
cpp/src/parquet/encryption/crypto_factory.cc:
##########
@@ -172,8 +201,8 @@ std::shared_ptr<FileDecryptionProperties> 
CryptoFactory::GetFileDecryptionProper
     const KmsConnectionConfig& kms_connection_config,
     const DecryptionConfiguration& decryption_config, const std::string& 
file_path,
     const std::shared_ptr<::arrow::fs::FileSystem>& file_system) {
-  auto key_retriever = std::make_shared<FileKeyUnwrapper>(
-      &key_toolkit_, kms_connection_config, 
decryption_config.cache_lifetime_seconds,
+  auto key_retriever = std::make_shared<CryptoFactoryFileKeyRetriever>(

Review Comment:
   No this isn't related to having access to the CryptoFactory instance, I 
think this discussion has gotten a bit sidetracked from the main point of the 
PR.
   
   My use case is probably a bit non-standard as I'm trying to provide a .NET 
binding to the encryption API without users having to deal with manual memory 
management. I have no control over the CryptoFactory lifetime so if users of 
ParquetSharp misuse the API they could get a segfault, which isn't usually 
expected in a managed language/runtime.
   
   This got me thinking, I wonder how pyarrow handles this. And it turns out 
pyarrow has the same issue. This code causes a hang on my machine, but works if 
I don't have the `del crypto_factory` line:
   
   ```python
   import base64
   import gc
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.parquet.encryption as pe
   
   
   class TestKmsClient(pe.KmsClient):
       """ Dummy KMS client that base64 encodes keys """
   
       def __init__(self, kms_connection_configuration):
           pe.KmsClient.__init__(self)
   
       def wrap_key(self, key_bytes, master_key_identifier):
           return f"{base64.b64encode(key_bytes).decode('utf-8')}"
   
       def unwrap_key(self, wrapped_key, master_key_identifier):
           return base64.b64decode(wrapped_key)
   
   
   # Create test data
   x = pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], type=pa.int32())
   y = pa.array([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
type=pa.float32())
   schema = pa.schema([
       pa.field("x", pa.int32(), nullable=False),
       pa.field("y", pa.float32(), nullable=False),
   ])
   table = pa.table([x, y], schema=schema)
   
   
   # Write encrypted file
   crypto_factory = pe.CryptoFactory(lambda config: TestKmsClient(config))
   config = pe.KmsConnectionConfig()
   encryption_config = pe.EncryptionConfiguration(
      footer_key="Key0",
      column_keys={
         "Key0": ["x", "y"],
      },
      double_wrapping=True,
      internal_key_material=True
   )
   encryption_properties = crypto_factory.file_encryption_properties(config, 
encryption_config)
   
   with pq.ParquetWriter("./encrypted_py.parquet", table.schema,
                        encryption_properties=encryption_properties) as writer:
      writer.write_table(table)
   
   # Read encrypted file
   decryption_config = pe.DecryptionConfiguration()
   decryption_properties = crypto_factory.file_decryption_properties(config, 
decryption_config)
   
   del crypto_factory
   gc.collect()
   
   with pq.ParquetFile("./encrypted_py.parquet",
                        decryption_properties=decryption_properties) as reader:
      read_table = reader.read()
   
   print(read_table)
   ```
   
   A similar example using a more realistic encrypted file caused a segfault.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to