advancedxy commented on issue #5626:
URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1641368530

   > I think the distribution math there is a bit off since it assumes that the 
skew is only present at the first bit of the hashing function. The assumption 
that increasing the number of buckets evenly divides the skew is a bit of an 
issue since this assumes the skew is generally present and correlated with the 
hashing function but only at the first bit. 
   
   The assumption is that skew is generally present especially for multiple 
cols. I don't agree with the part `the skew is only present at the first bit of 
the hashing function`. When the skew is present, it would somehow related to 
the hash result.
   
   > That said I did run some experiments on my own and while there wasn't a 
ton of difference between composition and running the function on it's own, 
there was a benefit to using the combined function. I'll try to finish up my 
test framework for running more examples.
   
   This is a good way to demonstrate the distribution. I also did a quick 
program to demonstrate my idea, see 
https://gist.github.com/advancedxy/236a8db8de03cf40c2ecbfebd4bf07ef for 
details. That might still be a simplified version, but I think it matches my 
previous calculation.
   
   > If we really wanted to include col_a we would do an identity transform 
correct?
   
   For this part, when we are dealing with primary key table when the primary 
key is made of multiple columns, I think it's important to include all columns 
as buckets. An identity transform might not be sufficient.
   
   > Anyway we probably do need to support multi-arg transforms within Iceberg 
at some point, so this may be a good time to start a design document and work 
towards adding that to spec as a First step.
   
   +1. I agree adding that to spec as first step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to