adamreeve commented on issue #47435: URL: https://github.com/apache/arrow/issues/47435#issuecomment-3240469675
Above you describe your use case as: > * A parquet file is encrypted with a DEK. > * The DEK is encrypted with KEK. > * The encrypted DEK and related metadata (i.e., KMS info to get KEK) is printed as JSON in a metadata file. > * The parquet file and metadata file are uploaded to a cloud location. This is exactly what the higher level API does for you if you set [`internal_key_material`](https://github.com/apache/arrow/blob/ed77d25149569eb9a48f61f3694fd8ea4b9a411d/cpp/src/parquet/encryption/crypto_factory.h#L77-L81) to false (although it also uses double wrapping by default, but you can disable that too). So I would suggest looking more closely at the high level API to see if it can work for you, rather than re-implement this yourself. That said, I think it would be OK to expose the lower-level API in Python for use cases where users want more control over how keys are managed, as long as the downsides are clearly documented and users are directed to the higher level API by default (the C++ docs/examples could probably be improved here). I believe PyArrow is the only Parquet implementation that supports encryption but doesn't expose the lower level API, so this would seem to actually hinder compatibility. For example, I'm not super familiar with parquet-java, but I think you can set `spark.hadoop.parquet.crypto.factory.class` to your own class that implements `EncryptionPropertiesFactory` and `DecryptionPropertiesFactory`, instead of having to use the `PropertiesDrivenCryptoFactory`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org