rdblue commented on code in PR #5432:
URL: https://github.com/apache/iceberg/pull/5432#discussion_r950916081


##########
format/gcm-stream-spec.md:
##########
@@ -0,0 +1,87 @@
+---
+title: "AES GCM Stream Spec"
+url: gcm-stream-spec
+toc: true
+disableSidebar: true
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+# AES GCM Stream (AGS) file format extension
+
+## Background and Motivation
+
+Iceberg supports a number of data file formats. Two of these formats (Parquet 
and ORC) have built-in encryption capabilities, that allow to protect sensitive 
information in the data files. However, besides the data files, Iceberg tables 
also have metadata files, that keep sensitive information too (e.g., min/max 
values in manifest files, or bloom filter bitsets in puffin files). Metadata 
file formats (AVRO, JSON, Puffin) don't have encryption support.
+
+Moreover, with the exception of Parquet, no Iceberg data or metadata file 
format supports integrity verification, required for end-to-end tamper proofing 
of Iceberg tables.
+
+This document specifies details of a simple file format extension that adds 
encryption and tamper-proofing to any existing file format.
+
+## Goals
+
+* Metadata encryption: enable encryption of manifests, manifest lists, 
snapshots and stats.
+* Avro data encryption: enable encryption of data files in tables that use the 
Avro format.
+* Tamper proofing of Iceberg data and metadata files.
+
+## Overview
+
+The output stream, produced by a metadata or data writer, is split into 
equal-size blocks (plus residue). Each block is enciphered (encrypted/signed) 
with a given encryption key, and stored in a file in the AGS format. Upon 
reading, the stored cipherblocks are verified for integrity; then decrypted and 
passed to metadata or data readers.
+
+## Encryption algorithm
+
+AGS uses the standard AEG GCM cipher, and supports all AES key sizes: 128, 192 
and 256 bits.
+
+AES GCM is an authenticated encryption. Besides data confidentiality 
(encryption), it supports two levels of integrity verification 
(authentication): of the data (default), and of the data combined with an 
optional AAD (“additional authenticated data”). An AAD is a free text to be 
authenticated, together with the data. The structure of AGS AADs is described 
below.
+
+AES GCM requires a unique vector to be provided for each encrypted block. In 
this document, the unique input to GCM encryption is called nonce (“number used 
once”). AGS encryption uses the RBG-based (random bit generator) nonce 
construction as defined in the section 8.2.2 of the NIST SP 800-38D document. 
For each encrypted block, AGS generates a unique nonce with a length of 12 
bytes (96 bits).
+
+## Format specification
+
+### File structure
+
+The AGS-encrypted files have the following structure
+
+```
+Magic BlockLength CipherBlock₁ CipherBlock₂ ... CipherBlockₙ
+```
+
+where
+
+- `Magic` is four bytes 0x41, 0x47, 0x53, 0x31 ("AGS1", short for: AES GCM 
Stream, version 1)
+- `BlockLength` is four bytes (little endian) integer keeping the length of 
the equal-size split blocks before encryption. The length is specified in bytes.
+- `CipherBlockᵢ` is the i-th enciphered block in the file, with the structure 
defined below.
+
+### Cipher Block structure
+
+Cipher blocks have the following structure
+
+| nonce | ciphertext | tag |
+|-------|------------|-----|

Review Comment:
   To be able to split formats like Avro that are splittable, we need to be 
able to go back and forth between offsets in the underlying data stream and in 
the AGS stream. I think that's possible since this has a strict format. The 
header structure is 8 bytes (4 magic + 4 block length) and then there is 24 
bytes of overhead per CipherBlock (12 nonce + 12 tag).
   
   For any given offset in the underlying stream, we can find the block where 
it will be found by dividing by the block length:
   
   ```python
   def plain_block_index(plain_offset, block_length):
       return offset // block_length
   ```
   
   Then we can skip to a given block:
   
   ```python
   def cipher_block_offset(block_index, block_length):
       return 8 + block_index * (24 + block_length)
   ```
   
   But the problem is that we don't get split offsets in the underlying file, 
we get split offsets in the AGS stream and need to figure out where in the 
underlying file those translate to, so that we don't have overlapping ranges 
that tasks are responsible for. We have a picture like this for a block size of 
1024, 3 blocks, and a last block of size 553:
   
   |  | magic | block length | nonce-1 | ciphertext-1 | tag-1 | nonce-2 | 
ciphertext-2 | tag-2 | nonce-3 | ciphertext-3 | tag-3 | EOF |
   | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
   | AGS offset | 0 | 4 | 8 | 20 (to 1043) | 1044 | 1056 | 1068 (to 2091) | 
2092 | 2104 | 2116 (to 2668) | 2669 | 2681 |
   | Stream offset |  |  |  | 0 (to 1023) |  |  | 1024 (to 2047) |  |  | 2048 
(to 2601) |  | 2602 |
   
   To convert an AGS offset into a stream offset, I think we need to produce 
the the block index and an offset in that block. Producing the block index is 
the opposite of the `cipher_block_offset` function:
   
   ```python
   def cipher_block_index(cipher_offset, block_length):
       if (cipher_offset < 8):
           raise ValueError()
       return (cipher_offset - 8) // (block_length + 24)
   ```
   
   Then the offset in the plaintext block can be produced by subtracting the 
ciphertext offset from the start of the ciphertext block and some extra 
accounting:
   
   ```python
   def plaintext_block_offset(cipher_offset, block_length):
       block_index = cipher_block_index(cipher_offset, block_length)
       block_start = cipher_block_offset(block_index, block_length)
       local_offset = ciphert_offset - block_start
       return local_offset if local_offset < block_length else block_length
   ```
   
   The check against `block_length` ensures that the local block offsets never 
overlap. Offsets that are between the ciphertext blocks (like 1048 in tag-1 for 
the example above) are mapped to the end of the plaintext block. This gives us 
a stable way to recover plaintext offsets.
   
   Using the plaintext offsets, I think we can reliably split AGS files that 
store Avro data:
   1. Convert AGS byte range splits into plaintext splits, removing any 0-byte 
splits (unlikely but possible)
   2. Implement an AGS stream that can seek to plaintext offsets (using the 
translation methods above)
   3. Pass the plaintext splits into the Avro reader with the plaintext seeking 
AGS stream
   
   @ggershinsky, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to