mccheah commented on issue #20: Encryption in Data Files URL: https://github.com/apache/incubator-iceberg/issues/20#issuecomment-443371232 I think we can build a top level interface that, when implemented with KMS, could accomplish those goals, but, could also be implemented by other means to support other key storage strategies. More precisely I _really_ don't want to tie us to KMS as the key storage backend for all Iceberg users. Here's a rough sketch of a design that I think could have that flexibility: - Define an `EncryptionMetadata` struct that is a column in the `DataFile` schema. It is optional. The struct has the following schema: ``` struct EncryptionMetadata { String keyName(); long keyVersion(); String encryptedKeyAlgorithm(); byte[] iv(); byte[] encryptedKeyBytes(); } ``` - Define the structure `EncryptionKey` as follows; it's simply the real key (more or less just this [KeyMaterial](https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyMaterial.java#L25) ``` struct EncryptionKey { byte[] keyBytes(); byte[] iv(); } ``` - Define a `KeyManager` that handles both read and generation of these keys. ``` struct GeneratedEncryptionKey { EncryptionMetadata keyMetadata(); EncryptionKey key(); } interface KeyManager { EncryptionKey resolveEncryptionkey(EncryptionMetadata metadata); // Optional, can just default to calling the individual resolve(...) multiple times. List<EncryptionKey> resolveEncryptionKeyBatch(List<EncryptionMetadata> metadatas); GeneratedEncryptionkey generateEncryptionKey(String filePath); // Optional, can just default to calling the individual generate(...) multiple times. List<GeneratedEncryptionKey> generateEncryptionKeyBatch(List<String> filePaths); } ``` - We remark that `GeneratedEncryptionKey` as a pair of decrypted and encrypted key is required because we need both the decrypted key to actually encrypt the output stream wherever we're doing the encrypted-write itself, as well as the encrypted key to store in the manifest file after the fact. Now let's suppose we were to implement all of the above with KMS as the storage backend. I'd suppose we could provide this as the default implementation that ships with Iceberg. - `KeyManager#resolveEncryptionKey` goes to the KMS and asks to decrypt the key with the given metadata's name and the encrypted bytes. KMS uses the master key and passes back the decrypted key. `KeyManager` converts that into an `Encryptionkey` object and returns the value. - `KeyManager#generateEncryptionKey` generates a local key, tells KMS to encrypt it and gets back the encrypted key, Return the pair of decrypted and encrypted key in a `GeneratedEncryptionKey` struct. Regarding hadoop-crypto, I was considering it less for the [key storage](https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyStorageStrategy.java) layer and more so for its [in-memory representation of keys](https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyMaterial.java) and its [generation of ciphers from those key objects](https://github.com/palantir/hadoop-crypto/blob/6d9e05a1e667f150f7d98435e93a0dd6f3ea5c08/crypto-core/src/main/java/com/palantir/crypto2/io/CryptoStreamFactory.java). Though the interface shouldn't expose hadoop-crypto objects, naturally. Finally to cap it all off I'd imagine `KeyManager` instances can be provided by implementations of `TableOperations` and must thus be serializable to be sent to e.g. Spark executors. Thoughts on the above proposal?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services