This is an automated email from the ASF dual-hosted git repository.
aokolnychyi pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new b9ebc71fbc Puffin: Add deletion-vector-v1 blob type (#11238)
b9ebc71fbc is described below
commit b9ebc71fbc9803b6a8a4b9ed63b9ad4adeb66edf
Author: Ryan Blue <[email protected]>
AuthorDate: Sat Nov 2 03:04:03 2024 -0700
Puffin: Add deletion-vector-v1 blob type (#11238)
---
format/puffin-spec.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/format/puffin-spec.md b/format/puffin-spec.md
index 7b4e3e6d96..0148db72e2 100644
--- a/format/puffin-spec.md
+++ b/format/puffin-spec.md
@@ -125,6 +125,61 @@ The blob metadata for this blob may include following
properties:
stored as non-negative integer value represented using decimal digits
with no leading or trailing spaces.
+#### `deletion-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted. A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* Combined length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`
+* The vector, serialized as described below
+* A CRC-32 checksum of the magic bytes and serialized vector as 4 bytes,
big-endian
+
+The position vector is serialized using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit
keys:
+ - The key stored as 4 bytes, little-endian
+ - A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+Note that the length and CRC fields are stored using big-endian, but the
+Roaring bitmap format uses little-endian values. Big endian values were chosen
+for compatibility with existing deletion vectors in Delta tables.
+
+The blob's `properties` must:
+
+* Include `referenced-data-file`, the location of the data file the delete
+ vector applies to; must be equal to the data file's `location` in table
+ metadata
+* Include `cardinality`, the number of deleted rows (set positions) in the
+ delete vector
+* Omit `compression-codec`; `deletion-vector-v1` is not compressed
+
+Snapshot ID and sequence number are not known at the time the Puffin file is
+created. `snapshot-id` and `sequence-number` must be set to -1 in blob metadata
+for Puffin v1.
+
+
+[roaring-bitmap-portable-serialization]:
https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations
+[roaring-bitmap-general-layout]:
https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout
+
### Compression codecs
The data can also be uncompressed. If it is compressed the codec should be one
of