Gidon Gershinsky created SPARK-33966:
----------------------------------------
Summary: Two-tier encryption key management
Key: SPARK-33966
URL: https://issues.apache.org/jira/browse/SPARK-33966
Project: Spark
Issue Type: New Feature
Components: SQL
Affects Versions: 3.1.0
Reporter: Gidon Gershinsky
Columnar data formats (Parquet and ORC) have recently added a column encryption
capability. The data protection follows the practice of envelope encryption,
where the Data Encryption Key (DEK) is freshly generated for each file/column,
and is encrypted with a master key (or an intermediate key, that is in turn
encrypted with a master key). The master keys are kept in a centralized Key
Management Service (KMS) - meaning that each Spark worker needs to interact
with a (typically slow) KMS server.
This Jira (and its sub-tasks) introduce an alternative approach, that on one
hand preserves the best practice of generating fresh encryption keys for each
data file/column, and on the other hand allows Spark clusters to have a
scalable interaction with a KMS server, by delegating it to the application
driver. This is done via two-tier management of the keys, where a random Key
Encryption Key (KEK) is generated by the driver, encrypted by the master key in
the KMS, and distributed by the driver to the workers, so they can use it to
encrypt the DEKs, generated there by Parquet or ORC libraries. In the workers,
the KEKs are distributed to the executors/threads in the write path. In the
read path, the encrypted KEKs are fetched by workers from file metadata,
decrypted via interaction with the driver, and shared among the
executors/threads.
The KEK layer further improves scalability of the key management, because
neither driver or workers need to interact with the KMS for each file/column.
Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks
(e.g., Presto, pandas) must be able to read/decrypt the files,
written/encrypted by this Spark-driven key management mechanism - and
vice-versa. [of course, only if both sides have proper authorisation for using
the master keys in the KMS]
A link to a discussion/design doc is attached.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]