ggershinsky commented on pull request #808: URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671228812
@shangxinli thanks again for the additional details, it was interesting to have a glance at your usecase. It is indeed somewhat different from ours. Your existing pipelines seem to be managed by a centralized schema service, able to add custom properties to eg Avro schemas. That's how the data can be encrypted without (much) changes in the framework pipelines, making encryption invisible to the user. Some modifications are required in the framework parquet WriteSupport classes for this to work. Our approach also allows for transparent pipeline encryption, but via a different centralized metadata service - the Hive MetaStore (and without changes in parquet WriteSupport): https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-parquet-encryption-on-hive-tables We can also let the user configure the encryption directly: https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-parquet-encryption We're working on a general encryption policy mechanism, to be contributed to framework open source repo(s), and, after seeing your comment, I think we need to work together to make sure your usecase will be supported there too. Anyway, going back to the Parquet repo, and the PR808. From the technical point of view, I believe the situation is quite clear. You work with two plug-ins - WriteSupport and crypto factory. In the WriteSupport init function, you get the custom parameter maps from the Avro schema, re-create the write context object, and copy the maps into ExtType custom maps. The write context is then passed to the crypto factory, where you extract the custom maps, and create the encryption properties. But you have other options in the same WriteSupport init function. One of them is to use the Hadoop configuration. It is not global anymore, but rather a local copy, an object created for the given worker. It still carries the global properties, and you can add your crypto properties here; they will be carried to the crypto factory. Another option is to use the "extraMetadata" field in the WriteContext, an additional existing channel for passing custom properties. Currently, you pass an empty Map there, but you could pass your properties instead; this field is accessible in the crypto factories. So, in the existing parquet WriteSupport init function, you have two alternatives today that allow to pass the Avro schema metadata, as required in your usecase - without adding a third custom channel and without changing the crypto factory model. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
