advancedxy commented on issue #5626: URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1641368530
> I think the distribution math there is a bit off since it assumes that the skew is only present at the first bit of the hashing function. The assumption that increasing the number of buckets evenly divides the skew is a bit of an issue since this assumes the skew is generally present and correlated with the hashing function but only at the first bit. The assumption is that skew is generally present especially for multiple cols. I don't agree with the part `the skew is only present at the first bit of the hashing function`. When the skew is present, it would somehow related to the hash result. > That said I did run some experiments on my own and while there wasn't a ton of difference between composition and running the function on it's own, there was a benefit to using the combined function. I'll try to finish up my test framework for running more examples. This is a good way to demonstrate the distribution. I also did a quick program to demonstrate my idea, see https://gist.github.com/advancedxy/236a8db8de03cf40c2ecbfebd4bf07ef for details. That might still be a simplified version, but I think it matches my previous calculation. > If we really wanted to include col_a we would do an identity transform correct? For this part, when we are dealing with primary key table when the primary key is made of multiple columns, I think it's important to include all columns as buckets. An identity transform might not be sufficient. > Anyway we probably do need to support multi-arg transforms within Iceberg at some point, so this may be a good time to start a design document and work towards adding that to spec as a First step. +1. I agree adding that to spec as first step. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
