Cheappie opened a new issue, #4295:
URL: https://github.com/apache/arrow-datafusion/issues/4295
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
Please correct me if I am wrong, but from what I understand each partition
from FileScanConfig (file_group) is executed sequentially. That means if there
is large disproportion of work that needs to be done (e.g. part A 10 files
10MB, part B 10 files 10GB), then query will take as long as largest partition
requires to get done.
**Describe the solution you'd like**
I would like implement work stealing by e.g. sharing emitter of
PartitionedFile among FileStream's, for example by having virtual partitions
that point to single partition after all.
**Describe alternatives you've considered**
* Migrate FileScanConfig from { file_groups: Vec<Vec<PartitionedFile>> } ->
{ file_groups: Vec<Box<dyn Partition>> }, that way we keep existing interface
pretty similar to what we have now. I would be able to make n virtual
partitions that internally point to single partition.
* Alternatively migrate FileScanConfig from { file_groups:
Vec<Vec<PartitionedFile>> } -> queue/stream of files that can be shared among n
workers
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]