Hi all,

While implementing table encryption in iceberg-rust, we found a couple
of undocumented formats that are required for interoperability but are
described in the spec only as "implementation-specific." We
have reverse-engineered these from Java's implementation to achieve
byte-compatibility. Any future implementation (PyIceberg, etc.) would need
to do the same.

I'd like to propose that we specify the following in the spec, likely as a
new appendix or an expansion of the encryption section.

1. StandardKeyMetadata — the file-level key metadata format

The `key_metadata` binary field (field 131 in data files, field 519 in
manifest lists) uses a versioned Avro encoding in Java's
`StandardKeyMetadata`:

Wire format: `[version: 1 byte (0x01)] [Avro binary datum]`

V1 schema:
```
required(0, "encryption_key", binary) -- plaintext DEK
optional(1, "aad_prefix", binary) -- per-file AAD prefix for AES-GCM
optional(2, "file_length", long) -- encrypted file size (for streaming
decryption)
```

2. The encryption-keys list — KEKs vs wrapped DEKs

The table-level `encryption-keys` list stores two kinds of entries,
distinguished by what `encrypted-by-id` points to:

**KEK entries** (`encrypted-by-id` = table master key ID):
- `encrypted-key-metadata`: the KEK wrapped by the KMS (opaque,
KMS-specific format)
- `properties`: includes `"key-timestamp"` (epoch millis) for expiration

**Wrapped manifest-list DEK entries** (`encrypted-by-id` = a KEK's key-id):
- `encrypted-key-metadata`: the `StandardKeyMetadata` Avro bytes (from #1
above) encrypted with AES-GCM using the referenced KEK, with the KEK's
timestamp string as AAD
- `properties`: empty

The convention for distinguishing these two types of entries, and the
wrapping scheme (AES-GCM with the KEK timestamp as AAD to prevent
tampering), are not documented anywhere in the spec from what I can see.

3. What can stay "implementation-specific"

The KEK's `encrypted-key-metadata` is intentionally opaque, it's whatever
the KMS returns from `wrapKey`. That's fine to leave unspecified since it's
between the implementation and its KMS provider.

### Why this matters

Without specifying #1 and #2, "implementation-specific" becomes a practical
interop barrier: tables encrypted by one implementation would be unreadable
by another despite both being spec-compliant. These formats are already
versioned and frozen in Java - the spec would just be documenting existing
reality.

Would there be interest in a PR for this? Happy to draft it.

Thanks,
Xander

Reply via email to