rdblue edited a comment on issue #430: Support bucket table for Iceberg
URL: 
https://github.com/apache/incubator-iceberg/issues/430#issuecomment-533360026
 
 
   Thanks @jerryshao! I had a look at the doc and made some comments.
   
   The main thing is that Iceberg already supports bucketing and has solved 
many of the challenges you identified, like schema evolution. There are two 
remaining problems:
   1. Writing requires users to [cluster data into buckets using a 
UDF](https://github.com/apache/incubator-iceberg/issues/274).
   2. Bucketed joins can't take advantage of Iceberg bucket values.
   
   For problem 1, we need to allow Iceberg to control the 
`requiredChildDistribution` and `requiredChildOrdering` returned by 
[`WriteToDataSourceV2Exec`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L318).
 Here's [a gist that shows what we use to automatically insert 
distribution/ordering 
requirements](https://gist.github.com/rdblue/468cff86ffcdd07dcea55520ab9c267c) 
that allow automatic bucketing writes. That also depends on #317.
   
   We also need a [FunctionCatalog](https://github.com/apache/spark/pull/24559) 
that allows us to return Iceberg transforms as UDFs that Spark can use.
   
   For problem 2, we are planning to add support for Spark to be able to use 
bucket values to speed up joins. We aren't quite sure how to do this yet, but 
we know that Spark will need to recognize that a table is bucketed (using the 
Table's partitioning), get the bucket function from the table's catalog (using 
FunctionCatalog) and use that function to prepare data for the other side of 
the join. If the other side of the join uses the same partition function, then 
we can avoid a shuffle for that side of the join as well.
   
   Hopefully this short write-up and the comments I left on the doc give you an 
idea of the current status of bucketed joins. Thanks for working on this!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to