andrei-ionescu commented on issue #1404: URL: https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-996191379
@Dandandan Let's go on with the example and what you just explained. Given I have the distribution above: 500 rows with `val1` partition value, 5 with `val2` and 10K rows with `v3`. Now, `hash(val1)` will always get the same value - lets say `9`. So, 500 rows will be attached that `9` value. Same principle for `v2` and `v3`. ``` 500 x hash(v1) => hash(v1) % 3 = 0 => 500 times 0 => 500 rows in 0 5 x hash(v2) => hash(v2) % 3 = 1 => 5 times 1 => 5 rows in 1 10K x hash(v3) => hash(v3) % 3 = 2 => 10K times 2 => 10K rows in 2 ``` That is what you explained above, right? Given the logic above I expect at the end when I call `collect_partitioned()` to receive the desired partitions. The simplest approach would be to receive a vector of size 3 containing another vector of multiple `RecordBatch` objects. And this is what's in deed returning. ``` vec[0] = vec![rec_batch1, rec_batch2, ...] vec[1] = vec![rec_batch_x1, rec_batch_x2, ...] vec[2] = vec![rec_batch_y1, rec_batch_y2, ...] ``` Up to this point it works as expected. Now, here is the thing that I say is not working, or does NOT yield the expected results: when I count all the rows in `rec_batch1` and add them to the number of rows in `rec_batch2` and so on, counting the rows and summing them up from all `RecordBatch`es that are part of `vec[0]`, I expect to find to find either 500 rows, 5 rows or 10K rows. That is what is not working. In the example I give in the description I couldn't find some matching numbers between the distribution and the collected result. Shouldn't it work like I described above? Am I understanding it wrong? Where I see a problem is the hashing function and the remainder operator. The hashing function is very important. If the cardinality of the hashing function is high the collision probability should be small. If `hash(v1) = 9` and `hash(v2) = 6` then `% 3` of both values is `0` those will get into the same partition resulting in 2 partitions with data and 1 empty. So, what's the hashing function applied on the values? Can you point me to that piece of code? Can we have a partitioning mechanism with the hashing function only (without the `%` operator)? Do you think that using big prime numbers for the number of partitions will result in better output? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org