advancedxy commented on issue #5626:
URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1640501450

   > The example you give is a problem though regardless of the bucketing 
function if the number of buckets is ~ = the cardinality of the column (or 
group of columns). The other thing to note about this example is that we would 
probably have just as good a distribution of rows if we just bucket'd 
(col_b,16). If we really wanted to include col_a we would do an identity 
transform correct?
   
   I don't think I made it clearly. Let me try to rephrase it in probability.
   
   TL;DR: bucket on each column multiplies the skewness between columns, bucket 
on multiple columns don't, instead it would reduce the skewness in different 
buckets as it has more entropy and more likely to be balanced.
   
   To make the example a bit simple, suppose:
   1. we have two columns: col_a and col_b, and we need to create 4 buckets via 
`bucket(col_a, col_b, 4)`
   2. the possibility of `hash(col_a) mod 2 === 0` is `1/3`, `hash(col_a) mod 2 
=== 1` is `2/3`
   3. the possibility of `hash(col_b) mod 2 === 0` is `1/4`, `hash(col_b) mod 2 
=== 1` is `3/4`
   
   Then, for partition spec: `bucket(col_a, 2) + bucket(col_b, 2)`, the 
possibility of each bucket(this is effectively the possible row number of that 
bucket) is: [1/12, 1/6, 1/4, 1/2](See calculations below).
   | c2 - c1 | c1 = 0 | c1 = 1 |
   |--------|--------|--------|
   | c2 = 0 | 1/3 * 1/4 = 1/12 | 2/3 * 1/4 = 1/6 |
   | c2 = 1 | 1/3 * 3/4 = 1/4  | 2/3 * 3/4 = 1/2 |
   
   What about the possibility of each bucket for `bucket(col_a, col_b, 4)`? We 
may need to add another assumption that:
   1. the possibility of `hash(col_a) mod 4 === 0` is `1/6`, `hash(col_a) mod 4 
=== 1` is `1/3`, `hash(col_a) mod 4 === 2` is `1/6`, `hash(col_a) mod 4 === 3` 
is `1/3`. (one may argue that the distribution could be in other ways, yeah, 
that's correct. But this distribution is most likely).
   2. the possibility of col_b is likewise.
   
   Then the possibility of each bucket is: [7/24, 5/24, 7/24, 5/24] (assuming 
my calculation is correct and I don't miss anything).
   
![IMG_3608](https://github.com/apache/iceberg/assets/807537/f9358aaa-3f9e-4735-bc12-988488e03606)
   
   Compare [1/12, 1/6, 1/4, 1/2] vs [7/24, 5/24, 7/24, 5/24], I believe the 
second one is more balanced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to