laskoviymishka opened a new issue, #1135:
URL: https://github.com/apache/iceberg-go/issues/1135

   Today iceberg-go writes deletion vectors for v3 unpartitioned tables 
(#1113), but v3 partitioned tables still fall through to the legacy Parquet 
position-delete writer with a deprecation warning. Spec mandates DV on v3 
regardless of partitioning, and Java's `BaseDVFileWriter` has no such split.
   
   Real-world example: time-series tables partitioned by month or day, where 
corrections and backfills routinely span partition boundaries. A single delete 
commit today emits one Parquet position-delete file per affected partition 
rather than a single Puffin file with one DV blob per affected data file — 
storage amplifies (Parquet rows of `(file_path, pos)` tuples are much larger 
than Roaring-compressed bitmaps), reads amplify (one Parquet file open per 
partition vs O(1) blob seeks per data file), and the deprecation warning fires 
on every delete commit until this lands.
   
   Java's `BaseDVFileWriter` is partition-agnostic at the Puffin level: one 
Puffin file globally per flush, one blob per data file, partition data 
propagated to each output DV manifest entry via 
`withPartition(deletes.partition())`. The "partitioned vs unpartitioned DV" 
split iceberg-go has today is an implementation choice rooted in `DVWriter.Add` 
not yet accepting partition metadata, not a spec distinction.
   
   Spec: https://iceberg.apache.org/spec/#deletion-vectors
   Java reference: `BaseDVFileWriter` in apache/iceberg
   
   Parent: #589.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to