This is an automated email from the ASF dual-hosted git repository.

aokolnychyi pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git


The following commit(s) were added to refs/heads/main by this push:
     new b9ebc71fbc Puffin: Add deletion-vector-v1 blob type (#11238)
b9ebc71fbc is described below

commit b9ebc71fbc9803b6a8a4b9ed63b9ad4adeb66edf
Author: Ryan Blue <[email protected]>
AuthorDate: Sat Nov 2 03:04:03 2024 -0700

    Puffin: Add deletion-vector-v1 blob type (#11238)
---
 format/puffin-spec.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/format/puffin-spec.md b/format/puffin-spec.md
index 7b4e3e6d96..0148db72e2 100644
--- a/format/puffin-spec.md
+++ b/format/puffin-spec.md
@@ -125,6 +125,61 @@ The blob metadata for this blob may include following 
properties:
   stored as non-negative integer value represented using decimal digits
   with no leading or trailing spaces.
 
+#### `deletion-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted. A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* Combined length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`
+* The vector, serialized as described below
+* A CRC-32 checksum of the magic bytes and serialized vector as 4 bytes, 
big-endian
+
+The position vector is serialized using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit 
keys:
+    - The key stored as 4 bytes, little-endian
+    - A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+Note that the length and CRC fields are stored using big-endian, but the
+Roaring bitmap format uses little-endian values. Big endian values were chosen
+for compatibility with existing deletion vectors in Delta tables.
+
+The blob's `properties` must:
+
+* Include `referenced-data-file`, the location of the data file the delete
+  vector applies to; must be equal to the data file's `location` in table
+  metadata
+* Include `cardinality`, the number of deleted rows (set positions) in the
+  delete vector
+* Omit `compression-codec`; `deletion-vector-v1` is not compressed
+
+Snapshot ID and sequence number are not known at the time the Puffin file is
+created. `snapshot-id` and `sequence-number` must be set to -1 in blob metadata
+for Puffin v1.
+
+
+[roaring-bitmap-portable-serialization]: 
https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations
+[roaring-bitmap-general-layout]: 
https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout
+
 ### Compression codecs
 
 The data can also be uncompressed. If it is compressed the codec should be one 
of

Reply via email to