alamb commented on pull request #972:
URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-919325405


   @yjshen 
   
   > My core idea is to partition the dataset referred to in each query based 
on required size in bytes, 
   
   I think that is a reasonable idea too. The `batch_size` hit passed to `scan` 
may be a proxy for this (as it is supposed to control how large the individual 
batch sizes are)
   
   > We engine could adaptively change table processing speed by schedule more 
or fewer tasks to run concurrently, based on the current payload of the system 
or a more complex operator DAG scheduling policy.
   
   Yes, I am very much on board with this idea (related to 
https://github.com/apache/arrow-datafusion/issues/64 I think)
   
   
   All in all I think improving the abstractions around "partitions" in 
datafusion and decoupling the execution concurrency from the number of distinct 
data streams, would be very helpful and a good direction to head. I think this 
is aligned too with what @rdettai  is saying as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to