rdblue edited a comment on issue #430: Support bucket table for Iceberg URL: https://github.com/apache/incubator-iceberg/issues/430#issuecomment-533360026 Thanks @jerryshao! I had a look at the doc and made some comments. The main thing is that Iceberg already supports bucketing and has solved many of the challenges you identified, like schema evolution. There are two remaining problems: 1. Writing requires users to [cluster data into buckets using a UDF](https://github.com/apache/incubator-iceberg/issues/274). 2. Bucketed joins can't take advantage of Iceberg bucket values. For problem 1, we need to allow Iceberg to control the `requiredChildDistribution` and `requiredChildOrdering` returned by [`WriteToDataSourceV2Exec`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L318). Here's [a gist that shows what we use to automatically insert distribution/ordering requirements](https://gist.github.com/rdblue/468cff86ffcdd07dcea55520ab9c267c) that allow automatic bucketing writes. That also depends on #317. We also need a [FunctionCatalog](https://github.com/apache/spark/pull/24559) that allows us to return Iceberg transforms as UDFs that Spark can use. For problem 2, we are planning to add support for Spark to be able to use bucket values to speed up joins. We aren't quite sure how to do this yet, but we know that Spark will need to recognize that a table is bucketed (using the Table's partitioning), get the bucket function from the table's catalog (using FunctionCatalog) and use that function to prepare data for the other side of the join. If the other side of the join uses the same partition function, then we can avoid a shuffle for that side of the join as well. Hopefully this short write-up and the comments I left on the doc give you an idea of the current status of bucketed joins. Thanks for working on this!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
