RussellSpitzer commented on issue #5626:
URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1642246921

   > The assumption is that skew is generally present especially for multiple 
cols. I don't agree with the part the skew is only present at the first bit of 
the hashing function. When the skew is present, it would somehow related to the 
hash result.
   
   This was in reference to 
   
   > the possibility of hash(col_a) mod 4 === 0 is 1/6, hash(col_a) mod 4 === 1 
is 1/3, hash(col_a) mod 4 === 2 is 1/6, hash(col_a) mod 4 === 3 is 1/3. (one 
may argue that the distribution could be in other ways, yeah, that's correct. 
But this distribution is most likely).
   
   I think we have very different models for skew. In my mind skew is an 
artifact of a small number of values which are extremely over-represented in a 
data set, not a systematic alignment with the hashing function. In my model 
increasing the number of buckets does not evenly divide their contents. Skew is 
apparent in our buckets because each time we bucket we get some random 
assortment of partitions and the number of "skewed" partitions.  As a trivial 
example, if I have a data set containing of the integers 1,2,3,4 but 3 is 
common for 70% of the dataset and the remaining digits cover 10% each. I would 
see
   (1,3 - 80% ) (2, 4 - %20)  
   
   But if I divide my buckets I see
   
   (10%, 70%, 10%, 10%)
   
   Because in my model skew is modeled like this there is no correlation with 
the hashing function (at least not in a broad way) and whenever we change our 
bucketing it's really just changing what random number of the "skewed" 
partitions end up in in a specific bucket. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to