yjshen edited a comment on pull request #972:
URL: https://github.com/apache/arrow-datafusion/pull/972#issuecomment-919153521


   Sorry for not participating in the discussion on the relationship among 
parallelism, number of partitions, target concurrencies, target partition size, 
etc., at an early stage. *I was intended to keeping the original term 
`max_concurrency` in DataFusion from my first review.* 
   
   Since we are getting more on a generalized design discussion here, I want to 
share my understandings:
   
   My core idea is to partition the dataset referred to in each query based on 
**required size in bytes**, i.e., our users should set the desired partition 
size in bytes (or tune the partition size parameter step by step). Or in other 
words, users choose [execution 
granularity](https://en.wikipedia.org/wiki/Granularity_(parallel_computing)) 
based on the amount of work performed by each task.  This has several 
implications:
   -  We engine could adaptively change table processing speed by schedule more 
or fewer tasks to run concurrently, based on the current payload of the system 
or a more complex operator DAG scheduling policy. 
   - It should be CBO's responsibility to decide the partition number instead 
of at the level of `TableProvider`. I suppose the partition number for scanning 
each table should depend highly on the real data size involved in a query. It's 
meaningless to divide a table with the same partition number for both `select 
count(distinct a) from table1` and `select a, b, c .... f from table1`.
   - Users could provide more cores when anticipating an execution speed up 
during a period of time, known as elasticity, I think. Ballista is more likely 
to get this benefit.
   
   Therefore, I like to have`SourceConfig` contain `target_partition_size` and 
`target_partition_num`, which can be mutated or replaced by CBO. And always 
prefer `target_partition_size` when statistics are available when stored in a 
catalog or lightweight enough to get. `target_partition_num` could be left out 
for num_cpus, for example. When the input table is small enough, there is no 
bother to care about the partition strategies.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to