[PR] feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan [datafusion-comet]

via GitHub Wed, 06 May 2026 15:44:22 -0700


jordepic opened a new pull request, #4251:
URL: https://github.com/apache/datafusion-comet/pull/4251


   ## Which issue does this PR close?
   
   Closes #4250.
   
   ## Rationale for this change
   
   A large number of query resources are devoted across the industry to 
rewriting data files using spark procedures for iceberg tables. Using native 
code here where possible can significantly speed up this process!
   
   ## What changes are included in this PR?
   
   Detect spark scans (`SparkStagedScan`) that are created during the 
`RewriteDataFilesSparkAction` and replace them with comet scans. Extract their 
associated tasks and pass in the lack of filter (see SparkStagedScan line 50 in 
the apache iceberg project).
   
   ## How are these changes tested?
   
   We write two tests to inspect the spark plan associated with rewriting data 
files and ensure that the operators get replaced. Before this change is merged 
I can also try to run it locally and pick up some benchmarks for table 
compactions (on tables that are only data files, and those with delete files 
associated).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: accelerate Iceberg RewriteDataFiles reads via Comet native scan [datafusion-comet]

Reply via email to