Dandandan commented on pull request #1500:
URL: 
https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1002310491


   Thanks for showing interest in this functionality!
   
   I believe this functionality shouldn't be a part of the `RepartitionExec`. 
The partitions (confusingly) mentioned there are a unit of parallelism and 
distributing data, not a way to aggregate data like in the example for writing 
to different partitions. The partitions in `PartitionExec` are equivalent to 
Spark partitions and mirror that functionality / design.
   
   We already have some pieces that we could use to implement the write support:
   * hash aggregation, i.e. `group by`. This could be used to group the same 
rows and append them to the output directories / files. `HashAggregateExec` has 
different modes, local and global. For writing, only local mode is needed, as 
we can write the rows to multiple files within the same directory (just like 
Spark does). 
   * `RepartitionExec::Hash`, allowing the parallelize the operations on more 
CPUs and nodes in Ballista. This is already utilized by using the hash 
aggregate code.
   
   The missing piece here is the support/wiring to *write* the final partitions 
to directories / files using a partitioning scheme.
   
   I don't have to much time ATM to write up a proposal myself, but if you see 
which direction I am going to, I support to write a design for this to collect 
feedback on it.
   
   It would be wise to look after some similar engines, like 
Spark/Hive/Trino/etc. for some inspiration.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to