Dear all,

As Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> passed
the vote, I would like to bring Parquet-1396
<https://issues.apache.org/jira/browse/PARQUET-1396> (Crypto Interface for
Schema Activation of Parquet Encryption
<https://docs.google.com/document/d/17GTQAezl1ZC1pMNHjYU_bPVxMU6DIPjtXOiLclXUlyA>
) for a design review. The motivation of Parquet-1396
<https://issues.apache.org/jira/browse/PARQUET-1396> is to make Parquet-1178
<https://issues.apache.org/jira/browse/PARQUET-1178> easier to be
integrated into existing applications and services. We will provide a
sample to implement this interface. The summaries of Parquet-1396
<https://issues.apache.org/jira/browse/PARQUET-1396> are as below. For
details, please have a look at the design Crypto Interface for Schema
Activation of Parquet Encryption
<https://docs.google.com/document/d/17GTQAezl1ZC1pMNHjYU_bPVxMU6DIPjtXOiLclXUlyA>.


Problem statement

Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> provided
the column encryption inside Parquet file, but to adopt it, Parquet
applications like Spark/Hive need to change the code by 1) calling the new
crypto API; 2) determining column sensitivity and building relative column
encryption properties; 3) handling key management which most likely
involves interacting with remote KMS(Key Management Service). In reality,
especially in a large organization, the Parquet library is being used by
many different applications and teams. Hence making a relative significant
code change to every application could make the adoption of the encryption
feature harder and slower. In this case, people usually prefer minimum
changes like upgrading Parquet library and configuration changes to enable
the feature.

Some organizations have centralized schema storage where developers can
define their table schemas. To adopt Parquet modular encryption, it would
be a natural choice for those developers by just changing their schema, for
example, appending a boolean configuration to a column to enable encryption
for that column, in the centralized schema storage. And then the technical
stack including ingesting pipelines, compute engines etc, just use that
schema to control column encryptions inside Parquet files, without any
further user involvement for encryption. Even there is no centralized
schema storage, it may still be desirable to control the column encryption
by just changing setting in the schema because it eases the encryption
management.
Goals

   1.

   Provide an interface for activating of the Parquet encryption proposed
   by Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178>, by
   the passed-in schema, wrapping up key management and several other crypto
   settings into a plugin which is the implementation of this interface, in
   order to ease the adopt of Parquet modular encryption.
   2.

   No change to specifications (Parquet-format), no new Parquet APIs, and
   no changes in existing Parquet APIs. All current applications, tests, etc,
   will work.
   3.

   If there is no plugin(the implementation of the proposed interface)
   configured in Hadoop configuration, all the changes discussed in this
   design will be bypassed, and all the existing behaviors will just work as
   before.

Technical approach

We propose a module - Crypto Properties Retriever which parses the passed
in schema from Parquet application, retrieves keys, key metadata, AAD
(Additional Authentication Data) prefix from external service, and manages
several other encryption settings. The encryption/decryption module added
by Parquet-1178 <https://issues.apache.org/jira/browse/PARQUET-1178> can
get the needed encryption information from this retriever to do the
encryption and decryption. This module will be released as a Parquet
plugin, which can be configured in Hadoop configuration to enable. This
proposal defines an interface contracting the implementation of this
plugin. The diagram below shows the relation between Parquet-1396
<https://issues.apache.org/jira/browse/PARQUET-1396> and Parquet-1178
<https://issues.apache.org/jira/browse/PARQUET-1178>.




>From developer perspective, they can just implement the interface into a
plugin which can be attached any Parquet application like Hive/Spark etc.
This decouples the complexity of dealing with KMS and schema from Parquet
applications. In large organization, they may have hundreds or even
thousands of Parquet applications and pipelines. The decoupling would make
Parquet encryption easier to be adopted.

>From end user(for example data owner) perspective, if they think a column
is sensitive, they can just set that column’s schema as sensitive and then
the Parquet application just encrypt that column automatically. This makes
end user easy to manage the encryptions of their columns.


Thanks in advance for your time! Looking forward to your feedbacks!

----------

Xinli Shang

Uber Big Data Team

Reply via email to