ggershinsky commented on pull request #2639: URL: https://github.com/apache/iceberg/pull/2639#issuecomment-860627950
> Can you provide a quick summary of how to plug into Parquet encryption and what this does? Certainly. There are two encryption interfaces in parquet-mr-1.12.0 : low-level (direct impl of the spec; max flexibility; no key management) and high-level (a layer on top of low-level, with a lib-local key management tools driven by Hadoop properties). In Iceberg, we'll use directly the low-level Parquet encryption API - because the key management will be done by Iceberg, in a similar fashion for all formats; and because Iceberg has a centralized manifest capability, which makes key management more efficient than running lib-local nodes in each worker process. Since Iceberg also taps directly into Parquet low-level API (general, no encryption), this PR enables to link it to the encryption feature, and translates general column encryption configuration (TBD) into Parquet encryption configuration. > it provides Parquet's equivalent of an EncryptionManager that gets the file AAD and necessary key material from Iceberg's key_metadata field. Well, here we have the gap that I've described in the other PR. To encrypt data(/delete) files, we need the AES keys - the DEKs, "data encryption keys" which are used to actually encrypt the data and metadata modules (there must be a unique DEK per file/column). But the `key_material` is a binary field in the manifest entry for a data(/delete) file that keeps a "wrapped" version of these DEKs (encrypted with master keys, MEKs, in user's KMS system). It doesn't and shouldn't keep raw DEKs. Therefore, sending `key_material` to Parquet file writers (or any other writers) doesn't help. Per my proposal in the last sync, we can reverse this process - generate random DEKs at (Parquet) writers, use them for encryption, and send them in the `DataFile`/`DeleteFile`/`ContentFile` objects back to the manifest writer. This also seems to fit the current Iceberg model, where manifest entries are written after collecting `ContentFile` objects from file writers. At this point, the manifest writer process will contact the KMS to wrap these DEKs, and package them into the `key_metadata` field in the manifest file (for the readers). In a future version, we might want to generate the DEKs (or get them from KMS) in the manifest writer process, and then distribute them to data/delete file writers, with a unique DEK per file (or a set of unique DEKs per file, for column encryption). This seems to be more complicated, and less fitting the current Iceberg flow; my suggestion would be to start with the reverse approach described above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
