[PR] WIP: Table Scan Delete File Handling [iceberg-rust]

via GitHub Fri, 27 Sep 2024 00:45:35 -0700


sdd opened a new pull request, #652:
URL: https://github.com/apache/iceberg-rust/pull/652


   This Draft PR outlines an approach to add support for proper handling of 
delete files within table scans.
   
   The approach taken is to include a list of delete file paths in every 
`FileScanTask`. At the moment it is assumed that this list refers to delete 
files that **may** apply to the data file in the scan, rather than having been 
filtered to only contain delete files that definitely do apply to this data 
file. Further optimisation of `plan_files` is expected in the future to ensure 
that this list is pre-filtered before being included in the FileScanTask.
   
   The arrow reader now has the responsibility of retrieving the applicable 
delete files from FileIO as well as the data file. Thanks to the Object Cache 
already being in place, if there are file scan tasks being handled in parallel 
in the same process that both attempt to load the same delete file, the Object 
Cache should ensure that only a single load and parse occurs.
   
   The approach taken for handling each type of delete file is as follows:
   
   ## Positional Deletes
   
   Positional Delete support is implemented using the `RowSelection` 
functionality of the parquet arrow reader that we are already using. The list 
of applicable positional delete indices is turned into a `RowSelection`. If the 
scan already has `enable_row_selection` enabled and there is a scan filter 
predicate, then the `RowSelection` from this is intersected with the positional 
delete `RowSelection` to yield a single combined `RowSelection`.
   
   ## Equality Deletes
   
   All rows from all applicable equality delete files are combined together to 
create a single `BoundPredicate`. If the scan also has a filter predicate, this 
is ANDed with the delete predicate to form a final combined `BoundPredicate` 
that is used as before to construct the arrow `RowFilter` and is also used in 
the row group filtering.
   
   ## TODO
   
   There is deliberately quite a lot of TODOs in here. This PR exists to get 
buy-in on the approach before all the implementation details are added and 
tests written. The code in this branch still builds though and all the tests 
(should! 😅 ) pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] WIP: Table Scan Delete File Handling [iceberg-rust]

Reply via email to