Dear all, As Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> passed the vote, I would like to bring Parquet-1396 <https://issues.apache.org/jira/browse/PARQUET-1396> (Crypto Interface for Schema Activation of Parquet Encryption <https://docs.google.com/document/d/17GTQAezl1ZC1pMNHjYU_bPVxMU6DIPjtXOiLclXUlyA> ) for a design review. The motivation of Parquet-1396 <https://issues.apache.org/jira/browse/PARQUET-1396> is to make Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> easier to be integrated into existing applications and services. We will provide a sample to implement this interface. The summaries of Parquet-1396 <https://issues.apache.org/jira/browse/PARQUET-1396> are as below. For details, please have a look at the design Crypto Interface for Schema Activation of Parquet Encryption <https://docs.google.com/document/d/17GTQAezl1ZC1pMNHjYU_bPVxMU6DIPjtXOiLclXUlyA>.
Problem statement Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> provided the column encryption inside Parquet file, but to adopt it, Parquet applications like Spark/Hive need to change the code by 1) calling the new crypto API; 2) determining column sensitivity and building relative column encryption properties; 3) handling key management which most likely involves interacting with remote KMS(Key Management Service). In reality, especially in a large organization, the Parquet library is being used by many different applications and teams. Hence making a relative significant code change to every application could make the adoption of the encryption feature harder and slower. In this case, people usually prefer minimum changes like upgrading Parquet library and configuration changes to enable the feature. Some organizations have centralized schema storage where developers can define their table schemas. To adopt Parquet modular encryption, it would be a natural choice for those developers by just changing their schema, for example, appending a boolean configuration to a column to enable encryption for that column, in the centralized schema storage. And then the technical stack including ingesting pipelines, compute engines etc, just use that schema to control column encryptions inside Parquet files, without any further user involvement for encryption. Even there is no centralized schema storage, it may still be desirable to control the column encryption by just changing setting in the schema because it eases the encryption management. Goals 1. Provide an interface for activating of the Parquet encryption proposed by Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178>, by the passed-in schema, wrapping up key management and several other crypto settings into a plugin which is the implementation of this interface, in order to ease the adopt of Parquet modular encryption. 2. No change to specifications (Parquet-format), no new Parquet APIs, and no changes in existing Parquet APIs. All current applications, tests, etc, will work. 3. If there is no plugin(the implementation of the proposed interface) configured in Hadoop configuration, all the changes discussed in this design will be bypassed, and all the existing behaviors will just work as before. Technical approach We propose a module - Crypto Properties Retriever which parses the passed in schema from Parquet application, retrieves keys, key metadata, AAD (Additional Authentication Data) prefix from external service, and manages several other encryption settings. The encryption/decryption module added by Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> can get the needed encryption information from this retriever to do the encryption and decryption. This module will be released as a Parquet plugin, which can be configured in Hadoop configuration to enable. This proposal defines an interface contracting the implementation of this plugin. The diagram below shows the relation between Parquet-1396 <https://issues.apache.org/jira/browse/PARQUET-1396> and Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178>. >From developer perspective, they can just implement the interface into a plugin which can be attached any Parquet application like Hive/Spark etc. This decouples the complexity of dealing with KMS and schema from Parquet applications. In large organization, they may have hundreds or even thousands of Parquet applications and pipelines. The decoupling would make Parquet encryption easier to be adopted. >From end user(for example data owner) perspective, if they think a column is sensitive, they can just set that column’s schema as sensitive and then the Parquet application just encrypt that column automatically. This makes end user easy to manage the encryptions of their columns. Thanks in advance for your time! Looking forward to your feedbacks! ---------- Xinli Shang Uber Big Data Team
