Dandandan edited a comment on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-993976204


   > I have some more question on how the hash partitioning works right now in 
DataFusion.
   > 
   > Given 3 unique values (`val1`, `val2`, `val3`) for the `my_col` column by 
which I want to repartition my data, distributed like this: 500 rows contain 
`val1`, 5 rows contain `val2` 10K rows contain `val3`.
   > 
   > 
   > What will be the result of partitioning with `Partitioning::Hash(my_col, 
3)`?
   > 
   > Will it repartition in 3 partition, partition 1 containing 500 rows all 
containing only `val1`, partition 2 with 5 rows only `val2` and partition 3 
with 10K rows containing only `val3`?
   > 
   > -- or --
   > 
   > Will it repartition in 3 partitions all having similar sizes, approx 3500 
each (`10K + 500 + 5 / 3`)? In this case what will each partition contain? Is 
it like: partition 1 with 3500 rows all containing `val3`, partition 2 with 
3500 rows all with `val3` and partition 3 with 500 + 5 + 3000 rows containing 
`val1` + `val2` + `val3`?
   > 
   > I guess it depends a lot on the hashing algorithm too due to collision 
probability. What's the hashing algorithm used?
   
   The answer is: the first is a possibility. They are hashed but also divided 
over a number of partitions (by `hash(val) % 3`). So the "collision" 
probability is much higher than that of the hashing algorithm.
   But each distinct `val` is assigned to the same partition, so the second 
option can't occur.
   
   For example, if the hash values happen to be 1, 4, and 7 (or any h where `h 
% 3 = 1`) for the 3 different values, all of the rows are assigned to partition 
1. Partition 0 and 2 are empty.
   
   Given enough distinct values (say 1000), all partitions are expected to 
contain roughly an equal number of distinct values per partition (e.g. roughly 
100 with 10 partitions).
   That's why usually it makes more sense to use it on a column with a higher 
number of distinct values.
   
   The idea of the function is to guarantee each unique value is assigned to 
the same group, so you can split the further work in more paralellizable, 
smaller tasks. The partitioning can be done in parallel, in a streaming 
fashion, without knowing the distinct values beforehand, hence it's a popular 
algorithm.
   
   See for an explanation on the same method from Spark here: 
https://luminousmen.com/post/spark-partitions
   For distributed computations it makes sense to make sure that equal values 
such as a customer-id are located in the same partition, so you can do further 
calculations within that group, knowing that each customer is in the same group.
   
   Internally, in the query execution plans, DataFusion also uses this hash 
partitioning to parallelize the work for joins and aggregates. Ballista also 
uses the same algorithm to distribute the workload.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to