ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671228812


   @shangxinli   thanks again for the additional details, it was interesting to 
have a glance at your usecase. It is indeed somewhat different from ours. Your 
existing pipelines seem to be managed by a centralized schema service, able to 
add custom properties to eg Avro schemas. That's how the data can be encrypted 
without (much) changes in the framework pipelines, making encryption invisible 
to the user. Some modifications are required in the framework parquet 
WriteSupport classes for this to work.  
   Our approach also allows for transparent pipeline encryption, but via a 
different centralized metadata service - the Hive MetaStore (and without 
changes in parquet WriteSupport):
   
https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-parquet-encryption-on-hive-tables
   We can also let the user configure the encryption directly:
   
https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-parquet-encryption
   We're working on a general encryption policy mechanism, to be contributed to 
framework open source repo(s), and, after seeing your comment, I think we need 
to work together to make sure your usecase will be supported there too. 
   
   Anyway, going back to the Parquet repo, and the PR808. From the technical 
point of view, I believe the situation is quite clear. You work with two 
plug-ins - WriteSupport and crypto factory.  In the WriteSupport init function, 
you get the custom parameter maps from the Avro schema, re-create the write 
context object, and copy the maps into ExtType custom maps. The write context 
is then passed to the crypto factory, where you extract the custom maps, and 
create the encryption properties.
   
   But you have other options in the same WriteSupport init function. One of 
them is to use the Hadoop configuration. It is not global anymore, but rather a 
local copy, an object created for the given worker. It still carries the global 
properties, and you can add your crypto properties here; they will be carried 
to the crypto factory. 
   Another option is to use the "extraMetadata" field in the WriteContext, an 
additional existing channel for passing custom properties. Currently, you pass 
an empty Map there, but you could pass your properties instead; this field is 
accessible in the crypto factories. So, in the existing parquet WriteSupport  
init function, you have two alternatives today that allow to pass the Avro 
schema metadata, as required in your usecase - without adding a third custom 
channel and without changing the crypto factory model.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to