[GitHub] [arrow-datafusion] andrei-ionescu commented on issue #1404: Hash partitioning not working properly

GitBox Thu, 16 Dec 2021 13:02:52 -0800


andrei-ionescu commented on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-996191379



   @Dandandan Let's go on with the example and what you just explained.
   
   Given I have the distribution above: 500 rows with `val1` partition value, 5 
with `val2` and 10K rows with `v3`. Now, `hash(val1)` will always get the same 
value - lets say `9`. So, 500 rows will be attached that `9` value. Same 
principle for `v2` and `v3`.
   
   ```
   500 x hash(v1)   =>  hash(v1) % 3 = 0   =>  500 times 0   =>  500 rows in 0
   5   x hash(v2)   =>  hash(v2) % 3 = 1   =>    5 times 1   =>  5   rows in 1
   10K x hash(v3)   =>  hash(v3) % 3 = 2   =>  10K times 2   =>  10K rows in 2
   ```
   
   That is what you explained above, right?
   
   Given the logic above I expect at the end when I call 
`collect_partitioned()` to receive the desired partitions. The simplest 
approach would be to receive a vector of size 3 containing another vector of 
multiple `RecordBatch` objects. And this is what's in deed returning. 
   
   ```
   vec[0] = vec![rec_batch1, rec_batch2, ...]
   vec[1] = vec![rec_batch_x1, rec_batch_x2, ...]
   vec[2] = vec![rec_batch_y1, rec_batch_y2, ...]
   ```
   
   Up to this point it works as expected.
   
   Now, here is the thing that I say is not working, or does NOT yield the 
expected results: when I count all the rows in `rec_batch1` and add them to the 
number of rows in `rec_batch2` and so on, counting the rows and summing them up 
from all `RecordBatch`es that are part of `vec[0]`, I expect to find to find 
either 500 rows, 5 rows or 10K rows.  
   
   That is what is not working. In the example I give in the description I 
couldn't find some matching numbers between the distribution and the collected 
result.
   
   Shouldn't it work like I described above? Am I understanding it wrong?
   
   Where I see a problem is the hashing function and the remainder operator. 
The hashing function is very important. If the cardinality of the hashing 
function is high the collision probability should be small. If `hash(v1) = 9` 
and `hash(v2) = 6` then `% 3` of both values is `0` those will get into the 
same partition resulting in 2 partitions with data and 1 empty.
   
   So, what's the hashing function applied on the values? Can you point me to 
that piece of code? 
   Can we have a partitioning mechanism with the hashing function only (without 
the `%` operator)?
   Do you think that using big prime numbers for the number of partitions will 
result in better output?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andrei-ionescu commented on issue #1404: Hash partitioning not working properly

Reply via email to