[
https://issues.apache.org/jira/browse/HIVE-25975?focusedWorklogId=733981&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-733981
]
ASF GitHub Bot logged work on HIVE-25975:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 28/Feb/22 16:48
Start Date: 28/Feb/22 16:48
Worklog Time Spent: 10m
Work Description: szlta opened a new pull request #3060:
URL: https://github.com/apache/hive/pull/3060
This adds a new UDF that uses Iceberg's bucket transformation function to
produce bucket values from constants or any column input. All types that
Iceberg buckets support are supported in this UDF too, except for UUID.
This UDF is then used in SortedDynPartitionOptimizer to sort data during
write if the target Iceberg target has bucket transform partitioning.
To enable this, Hive has been extended with the feature that allows storage
handlers to define custom sorting expressions, to be passed to FileSink
operator's DynPartContext during dynamic partitioning write scenarios.
The lenient version of ClusteredWriter in patched-iceberg-core has been
disposed of as it is not needed anymore with this feature in.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 733981)
Remaining Estimate: 0h
Time Spent: 10m
> Optimize ClusteredWriter for bucketed Iceberg tables
> ----------------------------------------------------
>
> Key: HIVE-25975
> URL: https://issues.apache.org/jira/browse/HIVE-25975
> Project: Hive
> Issue Type: Improvement
> Reporter: Ádám Szita
> Assignee: Ádám Szita
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The first version of the ClusteredWriter in Hive-Iceberg will be lenient for
> bucketed tables: i.e. the records do not need to be ordered by the bucket
> values, the writer will just close its current file and open a new one for
> out-of-order records.
> This is suboptimal for the long-term due to creating many small files. Spark
> uses a UDF to compute the bucket value for each record and therefore it is
> able to order the records by bucket values, achieving optimal clustering.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)