Dandandan edited a comment on issue #1404: URL: https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-993976204
> I have some more question on how the hash partitioning works right now in DataFusion. > > Given 3 unique values (`val1`, `val2`, `val3`) for the `my_col` column by which I want to repartition my data, distributed like this: 500 rows contain `val1`, 5 rows contain `val2` 10K rows contain `val3`. > > > What will be the result of partitioning with `Partitioning::Hash(my_col, 3)`? > > Will it repartition in 3 partition, partition 1 containing 500 rows all containing only `val1`, partition 2 with 5 rows only `val2` and partition 3 with 10K rows containing only `val3`? > > -- or -- > > Will it repartition in 3 partitions all having similar sizes, approx 3500 each (`10K + 500 + 5 / 3`)? In this case what will each partition contain? Is it like: partition 1 with 3500 rows all containing `val3`, partition 2 with 3500 rows all with `val3` and partition 3 with 500 + 5 + 3000 rows containing `val1` + `val2` + `val3`? > > I guess it depends a lot on the hashing algorithm too due to collision probability. What's the hashing algorithm used? The answer is: the first is a possibility. They are hashed but also divided over a number of partitions (by `hash(val) % 3`). So the "collision" probability is much higher than that of the hashing algorithm. But each distinct `val` is assigned to the same partition, so the second option can't occur. For example, if the hash values happen to be 1, 4, and 7 (or any h where `h % 3 = 1`) for the 3 different values, all of the rows are assigned to partition 1. Partition 0 and 2 are empty. Given enough distinct values (say 1000), all partitions are expected to contain roughly an equal number of distinct values per partition (e.g. roughly 100 with 10 partitions). That's why usually it makes more sense to use it on a column with a higher number of distinct values. The idea of the function is to guarantee each unique value is assigned to the same group, so you can split the further work in more paralellizable, smaller tasks. The partitioning can be done in parallel, in a streaming fashion, without knowing the distinct values beforehand, hence it's a popular algorithm. See for an explanation on the same method from Spark here: https://luminousmen.com/post/spark-partitions For distributed computations it makes sense to make sure that equal values such as a customer-id are located in the same partition, so you can do further calculations within that group, knowing that each customer is in the same group. Internally, in the query execution plans, DataFusion also uses this hash partitioning to parallelize the work for joins and aggregates. Ballista also uses the same algorithm to distribute the workload. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org