sdd opened a new pull request, #652: URL: https://github.com/apache/iceberg-rust/pull/652
This Draft PR outlines an approach to add support for proper handling of delete files within table scans. The approach taken is to include a list of delete file paths in every `FileScanTask`. At the moment it is assumed that this list refers to delete files that **may** apply to the data file in the scan, rather than having been filtered to only contain delete files that definitely do apply to this data file. Further optimisation of `plan_files` is expected in the future to ensure that this list is pre-filtered before being included in the FileScanTask. The arrow reader now has the responsibility of retrieving the applicable delete files from FileIO as well as the data file. Thanks to the Object Cache already being in place, if there are file scan tasks being handled in parallel in the same process that both attempt to load the same delete file, the Object Cache should ensure that only a single load and parse occurs. The approach taken for handling each type of delete file is as follows: ## Positional Deletes Positional Delete support is implemented using the `RowSelection` functionality of the parquet arrow reader that we are already using. The list of applicable positional delete indices is turned into a `RowSelection`. If the scan already has `enable_row_selection` enabled and there is a scan filter predicate, then the `RowSelection` from this is intersected with the positional delete `RowSelection` to yield a single combined `RowSelection`. ## Equality Deletes All rows from all applicable equality delete files are combined together to create a single `BoundPredicate`. If the scan also has a filter predicate, this is ANDed with the delete predicate to form a final combined `BoundPredicate` that is used as before to construct the arrow `RowFilter` and is also used in the row group filtering. ## TODO There is deliberately quite a lot of TODOs in here. This PR exists to get buy-in on the approach before all the implementation details are added and tests written. The code in this branch still builds though and all the tests (should! 😅 ) pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
