snithish opened a new issue, #2220:
URL: https://github.com/apache/iceberg-rust/issues/2220

   ### Is your feature request related to a problem or challenge?
   
   As noted in [#1604](https://github.com/apache/iceberg-rust/issues/1604), 
Iceberg-DataFusion read performance is currently bottlenecked by 
single-threaded execution. While [size-based 
planning](https://github.com/apache/iceberg-rust/issues/128) is the proposed 
long-term solution, a more immediate improvement would be to parallelize over 
FileScanTask and leverage `ArrowReaderBuilder` during plan execution. 
   
   ### Describe the solution you'd like
   
   Pre-calculates the `FileScanTask` streams and partitions them across the 
available DataFusion partitions, updating the IcebergTableScan struct and 
ExecutionPlan trait:
   
   **Pre-partitioning Scan Tasks:** IcebergTableScan now accepts a grouped 
tasks: Vec<Vec<FileScanTask>> rather than computing streams eagerly.
   **Propagating Partition Counts:** The compute_properties method now 
dynamically returns Partitioning::UnknownPartitioning(tasks.len()) instead of 
the hardcoded 1.
   **Parallel Stream Execution:** The execute phase uses 
`self.tasks.get(partition)` to spawn an `ArrowReaderBuilder` specific only to 
the slice of tasks mapped to that discrete DataFusion partition index.
   
   ### Willingness to contribute
   
   I would be willing to contribute to this feature with guidance from the 
Iceberg Rust community


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to