[
https://issues.apache.org/jira/browse/HIVE-26110?focusedWorklogId=752670&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-752670
]
ASF GitHub Bot logged work on HIVE-26110:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 05/Apr/22 07:21
Start Date: 05/Apr/22 07:21
Worklog Time Spent: 10m
Work Description: szlta commented on code in PR #3174:
URL: https://github.com/apache/hive/pull/3174#discussion_r842445223
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java:
##########
@@ -648,7 +648,12 @@ public ReduceSinkOperator getReduceSinkOp(List<Integer>
partitionPositions, List
ArrayList<ExprNodeDesc> partCols = Lists.newArrayList();
for (Function<List<ExprNodeDesc>, ExprNodeDesc> customSortExpr :
customSortExprs) {
- keyCols.add(customSortExpr.apply(allCols));
+ ExprNodeDesc colExpr = customSortExpr.apply(allCols);
+ // Custom sort expressions are marked as KEYs, which is required for
sorting the rows that are going for
+ // a particular reducer instance. They also need to be marked as
'partition' columns for MapReduce shuffle
+ // phase, in order to gather the same keys to the same reducer
instances.
+ keyCols.add(colExpr);
+ partCols.add(colExpr);
Review Comment:
If customSortExprs are present, then we can be sure that partitionPositions
are empty as per
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java#L592-L596
As for dpColNames - I'm not sure why it would matter? With Iceberg tables
the table schema already contains the partition columns too, it's just that
Hive doesn't think of these as partition columns, rather just regular columns.
I think the schema should be fine, all columns will serve as VALUE (with
Iceberg we want to write out the partition values into the file too, as in some
cases the spec can have a non-identity type of partition transform) plus the
ones identified by customSortExpr will be added as KEY for sorting purposes
(only) additionally.
Issue Time Tracking
-------------------
Worklog Id: (was: 752670)
Time Spent: 0.5h (was: 20m)
> bulk insert into partitioned table creates lots of files in iceberg
> -------------------------------------------------------------------
>
> Key: HIVE-26110
> URL: https://issues.apache.org/jira/browse/HIVE-26110
> Project: Hive
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> For e.g, create web_returns table in tpcds in iceberg format and try to copy
> over data from regular table. More like "insert into web_returns_iceberg as
> select * from web_returns".
> This inserts the data correctly, however there are lot of files present in
> each partition. IMO, dynamic sort optimisation isn't working fine and this
> causes records not to be grouped in the final phase.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)