laskoviymishka commented on issue #1077: URL: https://github.com/apache/iceberg-go/issues/1077#issuecomment-4444679667
Adding a bit of context on top of what @tanmayrauth said. For a one-shot predicate delete like yours, position deletes are the right default. That’s what `tbl.Delete(predicate, nil)` writes when `write.delete.mode=merge-on-read`. Equality deletes are a different tool. They’re useful when you already have the keys and don’t want to read data files just to discover row positions — typical case is streaming CDC where upstream gives you a primary key. In your case, you already need to scan the data files to evaluate the predicate, so you get the positions for free. No real reason to reach for equality deletes. The thing I’d watch is not really the delete mode — it’s what happens when delete files accumulate without compaction. We hit this on the CDC sink at [https://github.com/transferia/iceberg](https://github.com/transferia/iceberg), PostgreSQL → Iceberg v2 with equality-deletes-on-write: * clean 10K-row table: 41 ms scan * after ~5 delete files: 170 ms, around 4x slower * after ~50 delete files: 5,087 ms, around 124x slower Equality deletes are especially painful because they don’t have file-level pruning: every equality-delete file has to be checked against every data file. Position deletes are friendlier because they reference specific files, but the cost still grows as delete files pile up. So either way, the way to keep reads bounded is periodic compaction. Good news is that iceberg-go has grown native compaction recently, largely thanks to @tanmayrauth’s latest work here. `Transaction.RewriteDataFiles` (#892) does the bin-pack rewrite with deletes applied and dangling position-deletes cleaned up in the same commit. The CLI also has `iceberg compact analyze` / `iceberg compact run` (#903), so you can run this directly from Go or as a scheduled job without leaving the ecosystem. @tanmayrauth has also been filling in the surrounding maintenance commands over the last couple of weeks — `expire-snapshots`, `clean-orphan-files`, `partition-stats`, `info` — so the housekeeping story is much more complete than it was a month ago. On Deletion Vectors: reader and scanner integration are in via #866 and #996, so DV-based deletes can already be read. The write path is not wired yet; no firm timeline, but it’s the natural next step now that read works. For your case, it would not change the core story much: DVs are a more efficient way to encode “skip these positions”, but they still need compaction over time. TL;DR: stick with `write.delete.mode=merge-on-read`, let `tbl.Delete` produce position-delete files, and run `iceberg compact run` — or call `RewriteDataFiles` from code — on a schedule if you’ll be doing this regularly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
