alamb commented on pull request #972: URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-919325405
@yjshen > My core idea is to partition the dataset referred to in each query based on required size in bytes, I think that is a reasonable idea too. The `batch_size` hit passed to `scan` may be a proxy for this (as it is supposed to control how large the individual batch sizes are) > We engine could adaptively change table processing speed by schedule more or fewer tasks to run concurrently, based on the current payload of the system or a more complex operator DAG scheduling policy. Yes, I am very much on board with this idea (related to https://github.com/apache/arrow-datafusion/issues/64 I think) All in all I think improving the abstractions around "partitions" in datafusion and decoupling the execution concurrency from the number of distinct data streams, would be very helpful and a good direction to head. I think this is aligned too with what @rdettai is saying as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org