dojiong opened a new pull request, #1941:
URL: https://github.com/apache/iceberg-rust/pull/1941

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## What changes are included in this PR?
   
     Currently, ArrowReader instantiates a new CachingDeleteFileLoader (and 
consequently a new DeleteFilter) for each FileScanTask when calling 
load_deletes. This
     results in the DeleteFilter state being isolated per task. If multiple 
tasks reference the same delete file (common in positional deletes), that 
delete file is
     re-read and re-parsed for every task, leading to significant performance 
overhead and redundant I/O.
   
     Changes
   
      * Shared State: Moved the DeleteFilter instance into the 
CachingDeleteFileLoader struct. Since ArrowReader holds a single 
CachingDeleteFileLoader instance across
        its lifetime, the DeleteFilter state is now effectively shared across 
all file scan tasks processed by that reader.
      * Positional Delete Caching: Implemented a state machine for loading 
positional delete files (PosDelState) in DeleteFilter.
          * Added try_start_pos_del_load: Coordinates concurrent access to the 
same positional delete file.
          * Added finish_pos_del_load: Signals completion of loading.
          * Synchronization: Introduced a WaitFor state. Unlike equality 
deletes (which are accessed asynchronously), positional deletes are accessed 
synchronously by
            ArrowReader. Therefore, if a task encounters a file that is 
currently being loaded by another task, it must asynchronously wait 
(notify.notified().await)
            during the loading phase to ensure the data is fully populated 
before ArrowReader proceeds.
      * Refactoring: Updated load_file_for_task and related types in 
CachingDeleteFileLoader to support the new caching logic and carry file paths 
through the loading
        context.
   
   ## Are these changes tested?
   
   Added test_caching_delete_file_loader_caches_results to verify that repeated 
loads of the same delete file return shared memory objects


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to