ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-669725393


   > extend ParquetWriteSupport which converts crypto setting in schema..
   
   @shangxinli your previous pull request (already merged) added a very nice 
Encryption Factory interface, that allows for transparent activation of 
encryption via existing parameters (such as schema and hadoop configuration) - 
without making any modifications in the frameworks, including their Parquet 
glue code (write support). The current pull request breaks this model, since it 
introduces new configuration channels, and requires changes in the frameworks 
on top. These changes can also introduce a confusion that schema config changes 
is the way to activate Parquet encryption - while in fact the frameworks can 
activate Parquet encryption via eg the existing Hadoop config (with any proper 
implementation of EncryptionPropertiesFactory, such as 
PropertiesDrivenCryptoFactory). For example, Spark just needs to update its 
Parquet version, no other changes are needed. Lastly, this PR might break 
future interoperability with C++-based frameworks (such as pyarrow and pandas) 
- we're adding a
  high-level encryption support there, based on the standard 
EncryptionPropertiesFactory model.
   
   As mentioned above, this pull request is not a Schema activated encryption. 
The encryption  here is activated by a collection of custom parameter maps, 
piggy-backed on the Schema elements. Since such custom maps don't exist today, 
this PR adds them.
   
   Fortunately, there is a simple solution that allows to support your 
organization's requirements without adding such maps and without breaking the 
Crypto Factory model. Currently, you have your crypto parameters passed in 
custom <String,String> maps. But Hadoop config is also a <String,String> map. 
You already use it for file-wide parameters (such as encryption algorithm). You 
can also use it for column parameters - by adding them with column names, or 
simply by concatenating them into a single string parameter. See HIVE-21848 for 
an example (or PropertiesDrivenCryptoFactory, that implements HIVE-21848).
   
   > The way it works is SchemaCryptoPropertiesFactory should have RPC call 
with KMS ...
   > We don't want to release helper function like setKey() etc because that 
will require code changes in the existing pipelines. 
   
   You might go a step further, and utilize the PropertiesDrivenCryptoFactory 
directly in your deployment. This factory is designed for efficient KMS 
support, minimizing RPC calls to the absolute minimum. It sets keys and 
metadata without requiring code changes in the existing pipelines. This factory 
also implements the cryptography best practices, eg preventing mistakes like 
using the same data key for many files. Moreover, if using it, you will tap 
into community support and future improvements in this stuff. 
   But, this is only a recommendation; crypto factories are pluggable, and can 
be swapped with any custom implementation, as long as it follows the defined 
model.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to