laskoviymishka commented on issue #1077: URL: https://github.com/apache/iceberg-go/issues/1077#issuecomment-4445159884
TL;DR: Yes and Yes, *with some nuaces* Q1 — yes, with one small clarification. `tbl.Delete` is the high-level delete API. With `write.delete.mode=merge-on-read`, it writes **position-delete files**. With copy-on-write, which is the default, it rewrites data files. Equality deletes do **not** come from `Delete`. Those require calling the separate lower-level `WriteEqualityDeletes` API explicitly, usually for cases like CDC where you already have keys. So your framing is right for `Delete`: the call site doesn’t expose “position vs equality”; in MoR it just writes position deletes. But equality deletes are a separate path, not another mode of `Delete`. Q2 — yes, exactly. Position-delete files are just row-level “`file_path + position` was deleted” entries. They don’t help the planner skip files for `subject = 'foo'`. A scan like `subject = 'foo'` still needs to: 1. read every data file that wasn’t pruned, 2. evaluate `subject = 'foo'`, 3. then remove rows that are present in position-delete files. Predicate-level skipping comes from separate stats-based mechanisms: * manifest column bounds / file pruning * Parquet row-group statistics Those don’t know that you “deleted all foo rows.” They only look at min/max stats. So if `foo` is spread across files whose `subject` bounds still include `foo`, those files still get touched. The delete reduces the final row count, but not necessarily the planner’s file or row-group set. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
