mccheah commented on issue #20: Encryption in Data Files
URL: 
https://github.com/apache/incubator-iceberg/issues/20#issuecomment-443371232
 
 
   I think we can build a top level interface that, when implemented with KMS, 
could accomplish those goals, but, could also be implemented by other means to 
support other key storage strategies. More precisely I _really_ don't want to 
tie us to KMS as the key storage backend for all Iceberg users. Here's a rough 
sketch of a design that I think could have that flexibility:
   
   - Define an `EncryptionMetadata` struct that is a column in the `DataFile` 
schema. It is optional. The struct has the following schema:
   
   ```
   struct EncryptionMetadata {
       String keyName();
       long keyVersion();
       String encryptedKeyAlgorithm();
       byte[] iv();
       byte[] encryptedKeyBytes();
   }
   ```
   - Define the structure `EncryptionKey` as follows; it's simply the real key 
(more or less just this 
[KeyMaterial](https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyMaterial.java#L25)
   
   ```
   struct EncryptionKey {
       byte[] keyBytes();
       byte[] iv();
   }
   ```
   
   - Define a `KeyManager` that handles both read and generation of these keys.
   
   ```
   struct GeneratedEncryptionKey {
       EncryptionMetadata keyMetadata();
       EncryptionKey key();
   }
   
   interface KeyManager {
       EncryptionKey resolveEncryptionkey(EncryptionMetadata metadata);
   
       // Optional, can just default to calling the individual resolve(...) 
multiple times.
       List<EncryptionKey> resolveEncryptionKeyBatch(List<EncryptionMetadata> 
metadatas);
   
       GeneratedEncryptionkey generateEncryptionKey(String filePath);
   
       // Optional, can just default to calling the individual generate(...) 
multiple times.
       List<GeneratedEncryptionKey> generateEncryptionKeyBatch(List<String> 
filePaths);
   }
   ```
   
   - We remark that `GeneratedEncryptionKey` as a pair of decrypted and 
encrypted key is required because we need both the decrypted key to actually 
encrypt the output stream wherever we're doing the encrypted-write itself, as 
well as the encrypted key to store in the manifest file after the fact.
   
   Now let's suppose we were to implement all of the above with KMS as the 
storage backend. I'd suppose we could provide this as the default 
implementation that ships with Iceberg.
   
   - `KeyManager#resolveEncryptionKey` goes to the KMS and asks to decrypt the 
key with the given metadata's name and the encrypted bytes. KMS uses the master 
key and passes back the decrypted key. `KeyManager` converts that into an 
`Encryptionkey` object and returns the value.
   - `KeyManager#generateEncryptionKey` generates a local key, tells KMS to 
encrypt it and gets back the encrypted key, Return the pair of decrypted and 
encrypted key in a `GeneratedEncryptionKey` struct.
   
   Regarding hadoop-crypto, I was considering it less for the [key 
storage](https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyStorageStrategy.java)
 layer and more so for its [in-memory representation of 
keys](https://github.com/palantir/hadoop-crypto/blob/ac8d4474ee667121873bf0abf0674d83c78d8b90/crypto-keys/src/main/java/com/palantir/crypto2/keys/KeyMaterial.java)
 and its [generation of ciphers from those  key 
objects](https://github.com/palantir/hadoop-crypto/blob/6d9e05a1e667f150f7d98435e93a0dd6f3ea5c08/crypto-core/src/main/java/com/palantir/crypto2/io/CryptoStreamFactory.java).
 Though the interface shouldn't expose hadoop-crypto objects, naturally.
   
   Finally to cap it all off I'd imagine `KeyManager` instances can be provided 
by implementations of `TableOperations` and must thus be serializable to be 
sent to e.g. Spark executors.
   
   Thoughts on the above proposal?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to