mbutrovich commented on code in PR #2584:
URL: https://github.com/apache/iceberg-rust/pull/2584#discussion_r3414594167


##########
crates/iceberg/src/arrow/reader/pipeline.rs:
##########
@@ -431,14 +437,44 @@ impl ArrowReader {
         )
         .with_parquet_read_options(parquet_read_options);
 
-        let arrow_metadata = ArrowReaderMetadata::load_async(&mut reader, 
Default::default())
+        let arrow_reader_options = 
Self::build_arrow_reader_options(key_metadata)?;
+
+        let arrow_metadata = ArrowReaderMetadata::load_async(&mut reader, 
arrow_reader_options)
             .await
             .map_err(|e| {
                 Error::new(ErrorKind::Unexpected, "Failed to load Parquet 
metadata").with_source(e)
             })?;
 
         Ok((reader, arrow_metadata))
     }
+
+    /// Builds `ArrowReaderOptions`, adding `FileDecryptionProperties` when
+    /// key metadata is present for Parquet Modular Encryption.
+    fn build_arrow_reader_options(key_metadata: Option<&[u8]>) -> 
Result<ArrowReaderOptions> {

Review Comment:
   Resolution is hardcoded to `StandardKeyMetadata::decode(km)` -> plaintext 
DEK (line 456-458). Worth noting the spec defines `key_metadata` as 
"implementation-specific key metadata for encryption" (field 131), and 
`StandardKeyMetadata` is the reference implementation's format rather than a 
spec-mandated one. So baking in that single decode path means non-standard 
key-management schemes (for example, where `key_metadata` is a key reference 
resolved from a KMS at read time rather than an embedded DEK) cannot plug in.
   
   Would it be worth keeping the `StandardKeyMetadata` path as the default but 
exposing the resolution step as a trait, so other implementations can supply 
their own? Roughly:
   
   ```rust
   #[async_trait]
   pub trait FileDecryptionHandler: Send + Sync {
       async fn file_decryption_properties(&self, key_metadata: &[u8]) -> 
Result<FileDecryptionProperties>;
   }
   ```
   
   One reason to make it `async`: a scheme that resolves the DEK from a KMS at 
read time (where `key_metadata` carries a wrapped DEK plus a key id, not a 
plaintext DEK) needs a network round-trip during resolution. arrow-rs has a 
lower-level `parquet::encryption::decrypt::KeyRetriever`, but its 
`retrieve_key(&self, key_metadata) -> Result<Vec<u8>>` is synchronous, so it 
cannot perform an async unwrap inline. The general path is therefore to resolve 
the key (async) and hand arrow-rs an explicit-key `FileDecryptionProperties` 
via `builder(dek)` (exactly what this PR already does at line 457), with the 
default `StandardKeyMetadata` impl just being the synchronous no-KMS case.
   
   Iceberg Java structures it this way: `EncryptionManager` is an interface, 
with `StandardEncryptionManager` as the spec-compliant default, selected per 
table and pluggable via the `encryption.kms-impl` catalog property. Worth 
noting the seam there is at the manager / `decrypt` level, not only at the KMS 
client, because an alternative scheme may differ in both the `key_metadata` 
layout and the unwrap step, not just the KMS backend.
   
   Is a hook like this already planned for a later PR, or intentionally 
deferred? Mainly want to avoid the concrete decode path becoming hard to 
retrofit once the rest of the series builds on it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to