Cheappie opened a new issue, #4295:
URL: https://github.com/apache/arrow-datafusion/issues/4295

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Please correct me if I am wrong, but from what I understand each partition 
from FileScanConfig (file_group) is executed sequentially. That means if there 
is large disproportion of work that needs to be done (e.g. part A 10 files 
10MB, part B 10 files 10GB), then query will take as long as largest partition 
requires to get done.
   
   **Describe the solution you'd like**
   I would like implement work stealing by e.g. sharing emitter of 
PartitionedFile among FileStream's, for example by having virtual partitions 
that point to single partition after all.
   
   **Describe alternatives you've considered**
   * Migrate FileScanConfig from { file_groups: Vec<Vec<PartitionedFile>> } -> 
{ file_groups: Vec<Box<dyn Partition>> }, that way we keep existing interface 
pretty similar to what we have now. I would be able to make n virtual 
partitions that internally point to single partition.
   * Alternatively migrate FileScanConfig from { file_groups: 
Vec<Vec<PartitionedFile>> } -> queue/stream of files that can be shared among n 
workers 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to