xanderbailey commented on code in PR #16527: URL: https://github.com/apache/iceberg/pull/16527#discussion_r3304316938
########## format/spec.md: ########## @@ -1299,6 +1066,49 @@ Notes: 1. The format of encrypted key metadata is determined by the table's encryption scheme and can be a wrapped format specific to the table's KMS provider. +#### Standard Key Metadata + +The `key_metadata` field in manifest entries stores per-file encryption key material as a binary blob. To enable cross-implementation interoperability, the standard encryption scheme defines the following binary format for this field: + +``` +VersionByte Payload +``` + +where: + +* `VersionByte` is a single byte indicating the key metadata schema version. Currently, the only valid version is `0x01`. +* `Payload` is an Avro binary-encoded record (not a container file — only the raw binary encoding of the fields) using the schema for the given version. + +The Avro schema for version 1 is a record with the following fields, in order: + +| Field name | Avro type | Required | Description | +|---|---|---|---| +| **`encryption_key`** | `bytes` | _required_ | The data encryption key (DEK) for this file. Must be 16, 24, or 32 bytes (corresponding to AES-128, AES-192, or AES-256). | +| **`aad_prefix`** | `bytes` | _optional_ | Random AAD prefix used for encryption integrity protection. For [AES GCM Stream](gcm-stream-spec.md) files, the prefix is combined with a block index to form the per-block AAD. For [Parquet modular encryption](https://parquet.apache.org/docs/file-format/data-pages/encryption/), the prefix is passed as the `aad_file_unique` component. | +| **`file_length`** | `long` | _optional_ | The encrypted file length in bytes. Required for [AES GCM Stream](gcm-stream-spec.md) encrypted files to detect truncation attacks (see [AES GCM Stream file length](gcm-stream-spec.md#file-length)). Not set for Parquet encrypted files. | + +The usage of the `encryption_key` and `aad_prefix` fields depends on the file format: + +* **AES GCM Stream files** (manifest lists, manifests, and non-Parquet data files): The `encryption_key` is used directly as the AES-GCM key. The `aad_prefix` is combined with a 4-byte little-endian block index to form the AAD for each cipher block, as described in the [AES GCM Stream AAD section](gcm-stream-spec.md#additional-authenticated-data). The `file_length` field stores the encrypted file length for truncation detection. Review Comment: Made it a list here [bcaf862](https://github.com/apache/iceberg/pull/16527/commits/bcaf8624f8bd90b7171f21fe5eafbb2f9d95c619) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
