[
https://issues.apache.org/jira/browse/HIVE-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ádám Szita resolved HIVE-25975.
-------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Committed to master. Thanks for the thorough reviews from [~pvary] and [~Marton
Bod]
> Optimize ClusteredWriter for bucketed Iceberg tables
> ----------------------------------------------------
>
> Key: HIVE-25975
> URL: https://issues.apache.org/jira/browse/HIVE-25975
> Project: Hive
> Issue Type: Improvement
> Reporter: Ádám Szita
> Assignee: Ádám Szita
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 6.5h
> Remaining Estimate: 0h
>
> The first version of the ClusteredWriter in Hive-Iceberg will be lenient for
> bucketed tables: i.e. the records do not need to be ordered by the bucket
> values, the writer will just close its current file and open a new one for
> out-of-order records.
> This is suboptimal for the long-term due to creating many small files. Spark
> uses a UDF to compute the bucket value for each record and therefore it is
> able to order the records by bucket values, achieving optimal clustering.
> The proposed change adds a new UDF that uses Iceberg's bucket transformation
> function to produce bucket values from constants or any column input. All
> types that Iceberg buckets support are supported in this UDF too, except for
> UUID.
> This UDF is then used in SortedDynPartitionOptimizer to sort data during
> write if the target Iceberg target has bucket transform partitioning.
> To enable this, Hive has been extended with the feature that allows storage
> handlers to define custom sorting expressions, to be passed to FileSink
> operator's DynPartContext during dynamic partitioning write scenarios.
> The lenient version of ClusteredWriter in patched-iceberg-core has been
> disposed of as it is not needed anymore with this feature in.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)