andrei-ionescu edited a comment on issue #1404: URL: https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-993912895
I have some more question on how the hash partitioning works right now in DataFusion. Given 3 unique values (`val1`, `val2`, `val3`) for the `my_col` column by which I want to repartition my data, distributed like this: 500 rows contain `val1`, 5 rows contain `val2` 10K rows contain `val3`. What will be the result of partitioning with `Partitioning::Hash(my_col, 3)`? Will it repartition in 3 partition, partition 1 containing 500 rows all containing only `val1`, partition 2 with 5 rows only `val2` and partition 3 with 10K rows containing only `val3`? -- or -- Will it repartition in 3 partitions all having similar sizes, approx 3500 each (`10K + 500 + 5 / 3`)? In this case what will each partition contain? Is it like: partition 1 with 3500 rows all containing `val3`, partition 2 with 3500 rows all with `val3` and partition 3 with 500 + 5 + 3000 rows containing `val1` + `val2` + `val3`? I guess it depends a lot on the hashing algorithm too due to collision probability. What's the hashing algorithm used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org