[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1500: Add support for PartitionBy functionality

GitBox Wed, 29 Dec 2021 01:51:10 -0800


andrei-ionescu commented on pull request #1500:
URL: 
https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1002509571



   @Dandandan Thanks for the message. I didn't have any other way of reading 
data and writing it partitioned to the storage. If you have such example please 
guide me to it.
   
   My use case is this: read parquet data (multiple files) re-partition them 
and write them as OSS Delta. In my case I read the data in memory and collect 
it as partitioned then write each partition as separate file.
   
   Can you give me an example using DataFrame API of writing a data as parquet? 
Or if there is a section where I can implement the Delta writer or any other 
writer?
   
   I do not agree with the fact that this useful only for the write moment and 
neither going towards grouping and aggregating is a good way. I'm thinking of 
the following reasons:
   
   - When using `group by` it requires an aggregation function and this is not 
what it is needed. We don't aggregate anything we just shuffle the data in a 
specific way. 
   - I've seen that the shuffling process is implemented nicely using channels 
(for the other repartitioning modes) and I think that the same approach has to 
be used in this case too. 
   - Parallel processing is still needed after partitioning the data in this 
way. Summarising metrics or doing some aggregations  over such partitioned data 
is easy parallelizable which is a benefit. 
   - Using aggregation instead of shuffling will incur more performance penalty 
due to the fact that it needs to apply aggregation and keep some state.
   
   There are other benefits or pros that this partitioning brings in but in 
terms of cons I cannot think of too many and no blocker (I have limited 
knowledge of DataFusion at this moment). Some that come to my mind are these:
   - In some cases there is a more unbalanced partitioning trend which is ok. 
In the case of `Partitioning::Hash` the unbalanced partitioning happens too. So 
I don't know if this is really a downside.
   - A bit slower than the `Partitioning::Hash` due to locking.
   
   If there are any other cons that I'm not aware of, please explain them to 
me. Those will improve my knowledge on DataFusion.
   
   If the name the I did give to this functionality is not good - let's say 
that `PartitionBy` term is used for writing and that is misleading - then we 
can say something else like `ShuffleByExpression` or something else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1500: Add support for PartitionBy functionality

Reply via email to