ebyhr commented on code in PR #2822:
URL: https://github.com/apache/iceberg-python/pull/2822#discussion_r2604941480
##########
pyiceberg/table/puffin.py:
##########
@@ -114,3 +147,97 @@ def __init__(self, puffin: bytes) -> None:
def to_vector(self) -> dict[str, "pa.ChunkedArray"]:
return {path: _bitmaps_to_chunked_array(bitmaps) for path, bitmaps in
self._deletion_vectors.items()}
+
+
+class PuffinWriter:
Review Comment:
Iceberg spec recommends writing `created-by` property in Puffin files:
https://iceberg.apache.org/puffin-spec/#deletion-vector-v1-blob-type. Could you
please write the property?
##########
tests/table/test_puffin.py:
##########
@@ -71,3 +71,78 @@ def test_map_high_vals() -> None:
with pytest.raises(ValueError, match="Key 4022190063 is too large, max
2147483647 to maintain compatibility with Java impl"):
_ = _deserialize_bitmap(puffin)
+
+
+def test_puffin_round_trip() -> None:
+ # Define some deletion positions for multiple files
+ deletions1 = [10, 20, 30]
Review Comment:
Iceberg deletion vectors expect 1:1 for DV:path. The current PuffinWriter
may violate the expectation.
For instance, the following code writes 2 blobs without merging them. Do you
fix this behavior as a follow-up, or the caller should provide the merged DV?
```py
deletions1a = [10, 20]
deletions1b = [30]
...
writer.add(positions=deletions1a, referenced_data_file=file1_path)
writer.add(positions=deletions1b, referenced_data_file=file1_path)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]