adamreeve commented on issue #47435:
URL: https://github.com/apache/arrow/issues/47435#issuecomment-3240469675

   Above you describe your use case as:
   
   > * A parquet file is encrypted with a DEK.
   > * The DEK is encrypted with KEK.
   > * The encrypted DEK and related metadata (i.e., KMS info to get KEK) is 
printed as JSON in a metadata file.
   > * The parquet file and metadata file are uploaded to a cloud location.
   
   This is exactly what the higher level API does for you if you set 
[`internal_key_material`](https://github.com/apache/arrow/blob/ed77d25149569eb9a48f61f3694fd8ea4b9a411d/cpp/src/parquet/encryption/crypto_factory.h#L77-L81)
 to false (although it also uses double wrapping by default, but you can 
disable that too). So I would suggest looking more closely at the high level 
API to see if it can work for you, rather than re-implement this yourself.
    
   That said, I think it would be OK to expose the lower-level API in Python 
for use cases where users want more control over how keys are managed, as long 
as the downsides are clearly documented and users are directed to the higher 
level API by default (the C++ docs/examples could probably be improved here).
   
   I believe PyArrow is the only Parquet implementation that supports 
encryption but doesn't expose the lower level API, so this would seem to 
actually hinder compatibility. For example, I'm not super familiar with 
parquet-java, but I think you can set 
`spark.hadoop.parquet.crypto.factory.class` to your own class that implements 
`EncryptionPropertiesFactory` and `DecryptionPropertiesFactory`, instead of 
having to use the `PropertiesDrivenCryptoFactory`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to