tanmayrauth commented on issue #1077:
URL: https://github.com/apache/iceberg-go/issues/1077#issuecomment-4444141762
A couple of things to clear up:
Your scan is not causing any rewrites. Scanning is purely read-only, when
position-delete files exist, the scanner just loads them into memory and skips
those rows while reading. Nothing gets written back to disk.
The reason merge-on-read helps is that it changes what happens at delete
time, not scan time. With the default copy-on-write, calling tbl.Delete(...)
actually reads and rewrites affected data files without the deleted rows. With
merge-on-read, it just writes a small position-delete file saying "skip row 42,
row 108, etc." and leaves your data files completely alone. That's why it
minimizes rewrites - the original Parquet files are never touched.
The trade-off is that over time you accumulate delete files, which adds a
small overhead to reads (scanner has to apply them). Eventually you'd want to
compact, but that's a batch operation you run on your own schedule.
For V3 Deletion Vectors - they're essentially a more efficient encoding of
the same idea (bitmaps in Puffin files instead of separate Parquet delete
files). Faster to read, smaller on disk, but same fundamental property: no
data file rewrites on delete. They're not implemented in iceberg-go yet though
— the plumbing exists in puffin reader/writer but the scanner and write path
aren't wired up.
So the actionable answer: set "write.delete.mode": "merge-on-read" on your
table and you're good. Your deletes won't rewrite any data files.
cc @zeroshade @laskoviymishka - anything I'm missing here? Specifically
around timeline for DV support and whether we'd recommend equality deletes over
position deletes for simple predicate-based deletes like this(equality deletes
skip reading the data file entirely at delete time, but cost more at scan time).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]