[GitHub] [iceberg] ggershinsky commented on pull request #2639: Parquet: Support parquet modular encryption

GitBox Mon, 14 Jun 2021 05:00:32 -0700


ggershinsky commented on pull request #2639:
URL: https://github.com/apache/iceberg/pull/2639#issuecomment-860627950



   >  Can you provide a quick summary of how to plug into Parquet encryption 
and what this does?
   
   Certainly. There are two encryption interfaces in parquet-mr-1.12.0 : 
low-level (direct impl of the spec; max flexibility; no key management) and 
high-level (a layer on top of low-level, with a lib-local key management tools 
driven by Hadoop properties). In Iceberg, we'll use directly the low-level 
Parquet encryption API - because the key management will be done by Iceberg, in 
a similar fashion for all formats; and because Iceberg has a centralized 
manifest capability, which makes key management more efficient than running 
lib-local nodes in each worker process.
   Since Iceberg also taps directly into Parquet low-level API (general, no 
encryption), this PR enables to link it to the encryption feature, and 
translates general column encryption configuration (TBD) into Parquet 
encryption configuration. 
   
   > it provides Parquet's equivalent of an EncryptionManager that gets the 
file AAD and necessary key material from Iceberg's key_metadata field.
   
   Well, here we have the gap that I've described in the other PR. To encrypt 
data(/delete) files, we need the AES keys - the DEKs, "data encryption keys" 
which are used to actually encrypt the data and metadata modules (there must be 
a unique DEK per file/column). But the `key_material` is a binary field in the 
manifest entry for a data(/delete) file that keeps a "wrapped" version of these 
DEKs (encrypted with master keys, MEKs, in user's KMS system). It doesn't and 
shouldn't keep raw DEKs. Therefore, sending `key_material` to Parquet file 
writers (or any other writers) doesn't help. Per my proposal in the last sync, 
we can reverse this process - generate random DEKs at (Parquet) writers, use 
them for encryption, and send them in the `DataFile`/`DeleteFile`/`ContentFile` 
objects back to the manifest writer. This also seems to fit the current Iceberg 
model, where manifest entries are written after collecting `ContentFile` 
objects from file writers. At this point, the manifest writer 
 process will contact the KMS to wrap these DEKs, and package them into the 
`key_metadata` field in the manifest file (for the readers).
   
   In a future version, we might want to generate the DEKs (or get them from 
KMS) in the manifest writer process, and then distribute them to data/delete 
file writers, with a unique DEK per file (or a set of unique DEKs per file, for 
column encryption). This seems to be more complicated, and less fitting the 
current Iceberg flow; my suggestion would be to start with the reverse approach 
described above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ggershinsky commented on pull request #2639: Parquet: Support parquet modular encryption

Reply via email to