[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

GitBox Tue, 14 Dec 2021 11:35:52 -0800


andrei-ionescu edited a comment on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-993912895



   I have some more question on how the hash partitioning works right now in 
DataFusion.
   
   Given 3 unique values (`val1`, `val2`, `val3`) for the `my_col` column by 
which I want to repartition my data, distributed like this: 500 rows contain 
`val1`, 5 rows contain `val2` 10K rows contain `val3`.
   
   
   What will be the result of partitioning with `Partitioning::Hash(my_col, 3)`?
   
   Will it repartition in 3 partition, partition 1 containing 500 rows all 
containing only `val1`, partition 2 with 5 rows only `val2` and partition 3 
with 10K rows containing only `val3`?
   
   -- or --
   
   Will it repartition in 3 partitions all having similar sizes, approx 3500 
each (`10K + 500 + 5 / 3`)? In this case what will each partition contain? Is 
it like: partition 1 with 3500 rows all containing `val3`, partition 2 with 
3500 rows all with `val3` and partition 3 with 500 + 5 + 3000 rows containing 
`val1` + `val2` + `val3`?
   
   I guess it depends a lot on the hashing algorithm too due to collision 
probability. What's the hashing algorithm used?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

Reply via email to