[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

GitBox Mon, 10 Aug 2020 01:32:53 -0700


ggershinsky commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-671228812

@shangxinli thanks again for the additional details, it was interesting to
have a glance at your usecase. It is indeed somewhat different from ours. Your
existing pipelines seem to be managed by a centralized schema service, able to
add custom properties to eg Avro schemas. That's how the data can be encrypted
without (much) changes in the framework pipelines, making encryption invisible
to the user. Some modifications are required in the framework parquet
WriteSupport classes for this to work.
Our approach also allows for transparent pipeline encryption, but via a
different centralized metadata service - the Hive MetaStore (and without
changes in parquet WriteSupport):

https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-parquet-encryption-on-hive-tables
We can also let the user configure the encryption directly:

https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-parquet-encryption
We're working on a general encryption policy mechanism, to be contributed to
framework open source repo(s), and, after seeing your comment, I think we need
to work together to make sure your usecase will be supported there too.

Anyway, going back to the Parquet repo, and the PR808. From the technical
point of view, I believe the situation is quite clear. You work with two
plug-ins - WriteSupport and crypto factory. In the WriteSupport init function,
you get the custom parameter maps from the Avro schema, re-create the write
context object, and copy the maps into ExtType custom maps. The write context
is then passed to the crypto factory, where you extract the custom maps, and
create the encryption properties.

But you have other options in the same WriteSupport init function. One of
them is to use the Hadoop configuration. It is not global anymore, but rather a
local copy, an object created for the given worker. It still carries the global
properties, and you can add your crypto properties here; they will be carried
to the crypto factory.
Another option is to use the "extraMetadata" field in the WriteContext, an
additional existing channel for passing custom properties. Currently, you pass
an empty Map there, but you could pass your properties instead; this field is
accessible in the crypto factories. So, in the existing parquet WriteSupport
init function, you have two alternatives today that allow to pass the Avro
schema metadata, as required in your usecase - without adding a third custom
channel and without changing the crypto factory model.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] ggershinsky commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

Reply via email to