Gidon Gershinsky created SPARK-33966:
----------------------------------------

             Summary: Two-tier encryption key management
                 Key: SPARK-33966
                 URL: https://issues.apache.org/jira/browse/SPARK-33966
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.1.0
            Reporter: Gidon Gershinsky


Columnar data formats (Parquet and ORC) have recently added a column encryption 
capability. The data protection follows the practice of envelope encryption, 
where the Data Encryption Key (DEK) is freshly generated for each file/column, 
and is encrypted with a master key (or an intermediate key, that is in turn 
encrypted with a master key). The master keys are kept in a centralized Key 
Management Service (KMS) - meaning that each Spark worker needs to interact 
with a (typically slow) KMS server. 

This Jira (and its sub-tasks) introduce an alternative approach, that on one 
hand preserves the best practice of generating fresh encryption keys for each 
data file/column, and on the other hand allows Spark clusters to have a 
scalable interaction with a KMS server, by delegating it to the application 
driver. This is done via two-tier management of the keys, where a random Key 
Encryption Key (KEK) is generated by the driver, encrypted by the master key in 
the KMS, and distributed by the driver to the workers, so they can use it to 
encrypt the DEKs, generated there by Parquet or ORC libraries. In the workers, 
the KEKs are distributed to the executors/threads in the write path. In the 
read path, the encrypted KEKs are fetched by workers from file metadata, 
decrypted via interaction with the driver, and shared among the 
executors/threads.

The KEK layer further improves scalability of the key management, because 
neither driver or workers need to interact with the KMS for each file/column.

Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks 
(e.g., Presto, pandas) must be able to read/decrypt the files, 
written/encrypted by this Spark-driven key management mechanism - and 
vice-versa. [of course, only if both sides have proper authorisation for using 
the master keys in the KMS]

A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to