RussellSpitzer commented on issue #5626: URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1642246921
> The assumption is that skew is generally present especially for multiple cols. I don't agree with the part the skew is only present at the first bit of the hashing function. When the skew is present, it would somehow related to the hash result. This was in reference to > the possibility of hash(col_a) mod 4 === 0 is 1/6, hash(col_a) mod 4 === 1 is 1/3, hash(col_a) mod 4 === 2 is 1/6, hash(col_a) mod 4 === 3 is 1/3. (one may argue that the distribution could be in other ways, yeah, that's correct. But this distribution is most likely). I think we have very different models for skew. In my mind skew is an artifact of a small number of values which are extremely over-represented in a data set, not a systematic alignment with the hashing function. In my model increasing the number of buckets does not evenly divide their contents. Skew is apparent in our buckets because each time we bucket we get some random assortment of partitions and the number of "skewed" partitions. As a trivial example, if I have a data set containing of the integers 1,2,3,4 but 3 is common for 70% of the dataset and the remaining digits cover 10% each. I would see (1,3 - 80% ) (2, 4 - %20) But if I divide my buckets I see (10%, 70%, 10%, 10%) Because in my model skew is modeled like this there is no correlation with the hashing function (at least not in a broad way) and whenever we change our bucketing it's really just changing what random number of the "skewed" partitions end up in in a specific bucket. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
