ZENOTME opened a new issue, #1604:
URL: https://github.com/apache/iceberg-rust/issues/1604

   ### What's the feature are you trying to implement?
   
   As @colinmarc mention in 
https://apache-iceberg.slack.com/archives/C05HTENMJG4/p1753476472857519 , 
performance of iceberg-datafusion integration is lower than pure datafusion 
using ListingTable. As mention by @liurenjie1024 , it's caused by 
iceberg-datafusion integration only use one thread now. This issue propose a 
design to support parallel scan in iceberg-datafusion integration. And thanks 
to https://github.com/colinmarc/iceberg-datafusion-benchmarks from @colinmarc 
let us can dive into the bottleneck! 
   
   ## Row group based parallel scan 
   
    This parallel scan is row group based. The basic idea is to prune the file 
need scan into several group and pack them based on the parallism set by 
datafusion. The benefit of row group base parallel:
   1. Parallel scan even in less file scene
   2. More even distribute the read load after prune some row group in the 
file, e.g. 
https://github.com/apache/iceberg-rust/blob/d9fbc5c97e4d126a3850095beda725a8eb30229b/crates/iceberg/src/arrow/reader.rs#L241
   
   The process can be describe as following:
   - 1-3 is our 
[plan_file](https://github.com/apache/iceberg-rust/blob/d9fbc5c97e4d126a3850095beda725a8eb30229b/crates/iceberg/src/scan/mod.rs#L334)
 now, prune the iceberg metadata and get the FileScanTask finally. It's one 
data file bind with several delete file related to it.
   - 4 prune the row group data file, that's what we do in: 
https://github.com/apache/iceberg-rust/blob/d9fbc5c97e4d126a3850095beda725a8eb30229b/crates/iceberg/src/arrow/reader.rs#L241
   - 5 Based on the parallism, organize the row group into serveral partition 
task 
   
   Each partition task return as a RecordBatchStream when we execute using 
corresponding partition.
   
   <img width="1404" height="507" alt="Image" 
src="https://github.com/user-attachments/assets/3a7c07ad-a244-41f0-b7cb-3b0018afc7b7";
 />
   
   
   ### Willingness to contribute
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to