[GitHub] [iceberg] RussellSpitzer commented on issue #5626: Support bucket transform on multiple data columns

via GitHub Tue, 18 Jul 2023 05:29:40 -0700


RussellSpitzer commented on issue #5626:
URL: https://github.com/apache/iceberg/issues/5626#issuecomment-1640118093


   > > @vamen, you can just add multiple bucket partitions to bucket by more 
than one column. We chose to do this rather than hash the fields together so 
they can be used together or independently.
   > 
   > Hi, @rdblue I think we cannot express the same semantic for bucket 
transform on multiple columns with multiple bucket partitions.
   > 
   > For example `bucket(16, id_col, sec_col)` means that we need to bucket the 
`(id_col, sec_col)` into 16 buckets. If we are going to replace it with 
`bucket(16, id_col) + bucket(16, sec_col)`, it would create 16 * 16 = 256 
buckets. The most similar bucket spec would be `bucket(4, id_col) + bucket(4, 
sec_col)`, which creates 4 * 4 = 16 buckets. However that assumes the 
cardinality of bucket columns are balanced. It might also possible be 
`bucket(2, id_col) + bucket(8, sec_col)`. Did I miss anything?
   > 
   > Therefore, I believe it's still valuable to support bucketing on multiple 
columns, especially for primary key is made of multiple columns. WDYT?
   
   Other than being able to do prime numbers of buckets, i'm not sure I see the 
difference between 
   Bucket(column_a, x) and Bucket(column_b, y)  and Bucket(column_a, column_b, 
x*y). In the example above I don't think 2, 8 or 4, 4 would actually be 
distributed differently for normally distributed data. For skewed data they 
both would still most likely have an issue since only a single bucket would 
receive skew.
   
   Could you be a bit more clear?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on issue #5626: Support bucket transform on multiple data columns

Reply via email to