ggershinsky commented on code in PR #16527: URL: https://github.com/apache/iceberg/pull/16527#discussion_r3292244790
########## format/spec.md: ########## @@ -1213,6 +1066,45 @@ Notes: 1. The format of encrypted key metadata is determined by the table's encryption scheme and can be a wrapped format specific to the table's KMS provider. +#### Standard Key Metadata + +The `key_metadata` field in manifest entries stores per-file encryption key material as a binary blob. To enable cross-implementation interoperability, the standard encryption scheme defines the following binary format for this field: + +``` +VersionByte Payload +``` + +where: + +* `VersionByte` is a single byte indicating the key metadata schema version. Currently, the only valid version is `0x01`. +* `Payload` is an Avro binary-encoded record (not a container file — only the raw binary encoding of the fields) using the schema for the given version. + +The Avro schema for version 1 is a record with the following fields, in order: + +| Field name | Avro type | Required | Description | +|---|---|---|---| +| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | +| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | Review Comment: AAD prefix is used not only in AES GCM Stream files, but also in encrypted Parquet files (https://parquet.apache.org/docs/file-format/data-pages/encryption/ or https://github.com/apache/parquet-format/blob/master/Encryption.md) ########## format/spec.md: ########## @@ -1213,6 +1066,45 @@ Notes: 1. The format of encrypted key metadata is determined by the table's encryption scheme and can be a wrapped format specific to the table's KMS provider. +#### Standard Key Metadata + +The `key_metadata` field in manifest entries stores per-file encryption key material as a binary blob. To enable cross-implementation interoperability, the standard encryption scheme defines the following binary format for this field: + +``` +VersionByte Payload +``` + +where: + +* `VersionByte` is a single byte indicating the key metadata schema version. Currently, the only valid version is `0x01`. +* `Payload` is an Avro binary-encoded record (not a container file — only the raw binary encoding of the fields) using the schema for the given version. + +The Avro schema for version 1 is a record with the following fields, in order: + +| Field name | Avro type | Required | Description | +|---|---|---|---| +| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | +| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | +| **`file_length`** | `long` | _optional_ | The plaintext file length before encryption. Used to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). | Review Comment: This keeps file length after encryption, https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/encryption/AesGcmInputFile.java#L45 Only for AES GCM Stream files. Not set/used for encrypted Parquet data files. ########## format/spec.md: ########## @@ -1213,6 +1066,45 @@ Notes: 1. The format of encrypted key metadata is determined by the table's encryption scheme and can be a wrapped format specific to the table's KMS provider. +#### Standard Key Metadata + +The `key_metadata` field in manifest entries stores per-file encryption key material as a binary blob. To enable cross-implementation interoperability, the standard encryption scheme defines the following binary format for this field: + +``` +VersionByte Payload +``` + +where: + +* `VersionByte` is a single byte indicating the key metadata schema version. Currently, the only valid version is `0x01`. +* `Payload` is an Avro binary-encoded record (not a container file — only the raw binary encoding of the fields) using the schema for the given version. + +The Avro schema for version 1 is a record with the following fields, in order: + +| Field name | Avro type | Required | Description | +|---|---|---|---| +| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | +| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for [AES GCM Stream](gcm-stream-spec.md) integrity protection. | +| **`file_length`** | `long` | _optional_ | The plaintext file length before encryption. Used to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). | + +The AAD prefix is combined with a 4-byte little-endian block index to form the AAD for each AES GCM Stream cipher block, as described in the [AES GCM Stream AAD section](gcm-stream-spec.md#additional-authenticated-data). Review Comment: in Parquet encryption, this works differently, https://parquet.apache.org/docs/file-format/data-pages/encryption/ ########## format/spec.md: ########## @@ -667,7 +664,7 @@ The `data_file` struct consists of the following fields: | _optional_ | _optional_ | | ~~**`111 distinct_counts`**~~ | `map<123: int, 124: long>` | **Deprecated. Do not write.** | | _optional_ | _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | | _optional_ | _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | - | _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | + | _optional_ | _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Per-file encryption key metadata. See [Standard Key Metadata](#standard-key-metadata) for the interoperable format used by the standard encryption scheme. | Review Comment: there is also a `key_metadata` field in the Manifest File struct (field id 519) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
