advancedxy commented on issue #5626: URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1640501450
> The example you give is a problem though regardless of the bucketing function if the number of buckets is ~ = the cardinality of the column (or group of columns). The other thing to note about this example is that we would probably have just as good a distribution of rows if we just bucket'd (col_b,16). If we really wanted to include col_a we would do an identity transform correct? I don't think I made it clearly. Let me try to rephrase it in probability. TL;DR: bucket on each column multiplies the skewness between columns, bucket on multiple columns don't, instead it would reduce the skewness in different buckets as it has more entropy and more likely to be balanced. To make the example a bit simple, suppose: 1. we have two columns: col_a and col_b, and we need to create 4 buckets via `bucket(col_a, col_b, 4)` 2. the possibility of `hash(col_a) mod 2 === 0` is `1/3`, `hash(col_a) mod 2 === 1` is `2/3` 3. the possibility of `hash(col_b) mod 2 === 0` is `1/4`, `hash(col_b) mod 2 === 1` is `3/4` Then, for partition spec: `bucket(col_a, 2) + bucket(col_b, 2)`, the possibility of each bucket(this is effectively the possible row number of that bucket) is: [1/12, 1/6, 1/4, 1/2](See calculations below). | c2 - c1 | c1 = 0 | c1 = 1 | |--------|--------|--------| | c2 = 0 | 1/3 * 1/4 = 1/12 | 2/3 * 1/4 = 1/6 | | c2 = 1 | 1/3 * 3/4 = 1/4 | 2/3 * 3/4 = 1/2 | What about the possibility of each bucket for `bucket(col_a, col_b, 4)`? We may need to add another assumption that: 1. the possibility of `hash(col_a) mod 4 === 0` is `1/6`, `hash(col_a) mod 4 === 1` is `1/3`, `hash(col_a) mod 4 === 2` is `1/6`, `hash(col_a) mod 4 === 3` is `1/3`. (one may argue that the distribution could be in other ways, yeah, that's correct. But this distribution is most likely). 2. the possibility of col_b is likewise. Then the possibility of each bucket is: [7/24, 5/24, 7/24, 5/24] (assuming my calculation is correct and I don't miss anything).  Compare [1/12, 1/6, 1/4, 1/2] vs [7/24, 5/24, 7/24, 5/24], I believe the second one is more balanced. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
