[
https://issues.apache.org/jira/browse/IMPALA-14755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18068788#comment-18068788
]
ASF subversion and git services commented on IMPALA-14755:
----------------------------------------------------------
Commit 46b55b30d993e7c056140bc07c17b2203a963382 in impala's branch
refs/heads/master from Peter Rozsa
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=46b55b30d ]
IMPALA-14755:(part 1) Implement Puffin Blob reader and File writer
This is the first part of a multi-part implementation adding support for
Iceberg deletion vectors stored in Puffin files. This commit introduces
the core infrastructure for reading and writing Puffin format files
containing deletion vector blobs.
This commit adds:
- Generic BlobReader template base class for reading blob data from HDFS
with specialized DeletionVectorBlobReader for Puffin deletion vectors
- PuffinWriter that writes Puffin files with deletion vector blobs,
supporting merging of existing and new deletion vectors
- Puffin data structures (BlobMetadata, BlobData, File) and serialization
- Integration with table sink pipeline via new PUFFIN THdfsFileFormat
- Extended OutputPartition with PuffinWriteResult for tracking DV metadata
- CRC32 checksums for blob integrity and RoaringBitmap64::Or() for DV merging
- Updated Thrift/FlatBuffer schemas for deletion vector metadata
Testing:
- as this patch is a the first part of a multi-part implementation,
the functionality is validated only in part 2. For reader/writer
sanity checks, manual validation performed with Spark 3.5.5 with
Iceberg 1.10.1 installed.
Change-Id: I068a071f9db907064ccec8568db5234863eb4587
Reviewed-on: http://gerrit.cloudera.org:8080/24071
Reviewed-by: Zoltan Borok-Nagy <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Add support for Puffin File writer
> ----------------------------------
>
> Key: IMPALA-14755
> URL: https://issues.apache.org/jira/browse/IMPALA-14755
> Project: IMPALA
> Issue Type: New Feature
> Reporter: Peter Rozsa
> Assignee: Peter Rozsa
> Priority: Major
>
> Puffin is a lightweight container format for arbitrary binary blobs, and its
> implementation is a prerequisite for supporting the Iceberg V3 spec.
> Specifically, it enables the storage of Deletion Vectors as sidecar files and
> provides a standardized container for Column Statistics (e.g., Apache
> DataSketches Theta blobs).
> Design:
> https://docs.google.com/document/d/1EOfEdo5iAx4QDcIGaHXKwI5Z3lojNcm9ryfJSQW3grc/edit?tab=t.0#heading=h.b5qmi41d0vf2
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]