[GitHub] [arrow-datafusion] Cheappie commented on issue #4295: Change representation of partition in FileScanConfig

GitBox Thu, 24 Nov 2022 16:13:46 -0800


Cheappie commented on issue #4295:
URL: 
https://github.com/apache/arrow-datafusion/issues/4295#issuecomment-1326905014


   > I personally don't think we should have the concept of a partition at all, 
and should instead have a smarter work scheduler, but I haven't been able to 
work on that recently
   
   Yep, having partitions seem to be a limiting factor right now.
   
   There are two things on my plate right now:
   1. I would like to ensure that input data is well balanced among workers.
   2. Implementing prefetcher.
   
   In both of these points replacing somehow partitions with single queue would 
be helpful for me. But I understand that It might not be a priority or good 
enough solution for the project right now. Anyway the concept of partition 
seems to sit pretty deep in codebase, I saw that It is passed through hierarchy 
of ExecutionPlan's `execute(...)`.
   
   I wonder what kind of scheduler do you have in mind ?
   
   * Are operators going to be stateful or stateless ?
   * Will scheduler contain a DAG that would replace hierarchy based on 
`children()` from ExecutionPlan ?
   * How morsel paralellism will be implemented in DataFusion ? I wonder how 
fairness of sharing resources would be approached, because from what I have 
heard HyperDB processes single query at the time, that achieves ideal fairness 
with morsels. In concurrent systems queries from various users won't create 
equal morsels, e.g. one user might select more columns in projection or 
different operators in queries will have different cost. In my opinion It would 
be interesting to create morsels by splitting dataset into fixed size chunks 
(e.g. 1 MB RecordBatch) instead of number of tuples as it was done in paper.
   * Rayon should deal pretty well with work stealing, but is It sufficient to 
tackle fair resources sharing (e.g. CPU) ? Do you plan to rely on OS to time 
slice cpu or follow approach taken in morsel driven parallelism paper with 
pinning cores and managing them ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Cheappie commented on issue #4295: Change representation of partition in FileScanConfig

Reply via email to