ngsg commented on code in PR #4043:
URL: https://github.com/apache/hive/pull/4043#discussion_r1395752162
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/DynamicPartitionPruningOptimization.java:
##########
@@ -678,38 +678,34 @@ private boolean
generateSemiJoinOperatorPlan(DynamicListContext ctx, ParseContex
ArrayList<ColumnInfo> groupbyColInfos = new ArrayList<ColumnInfo>();
groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(0),
key.getTypeInfo(), "", false));
groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(1),
key.getTypeInfo(), "", false));
- groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(2),
key.getTypeInfo(), "", false));
+ groupbyColInfos.add(new ColumnInfo(gbOutputNames.get(2),
TypeInfoFactory.binaryTypeInfo, "", false));
GroupByOperator groupByOp =
(GroupByOperator)OperatorFactory.getAndMakeChild(
groupBy, new RowSchema(groupbyColInfos), selectOp);
groupByOp.setColumnExprMap(new HashMap<String, ExprNodeDesc>());
// Get the column names of the aggregations for reduce sink
- int colPos = 0;
ArrayList<ExprNodeDesc> rsValueCols = new ArrayList<ExprNodeDesc>();
Map<String, ExprNodeDesc> columnExprMap = new HashMap<String,
ExprNodeDesc>();
- for (int i = 0; i < aggs.size() - 1; i++) {
- ExprNodeColumnDesc colExpr = new ExprNodeColumnDesc(key.getTypeInfo(),
- gbOutputNames.get(colPos), "", false);
+ ArrayList<ColumnInfo> rsColInfos = new ArrayList<>();
+ for (int colPos = 0; colPos < aggs.size(); colPos++) {
+ TypeInfo typInfo = groupbyColInfos.get(colPos).getType();
+ ExprNodeColumnDesc colExpr = new ExprNodeColumnDesc(typInfo,
gbOutputNames.get(colPos), "", false);
rsValueCols.add(colExpr);
- columnExprMap.put(gbOutputNames.get(colPos), colExpr);
- colPos++;
- }
+ columnExprMap.put(Utilities.ReduceField.VALUE + "." +
gbOutputNames.get(colPos), colExpr);
- // Bloom Filter uses binary
- ExprNodeColumnDesc colExpr = new
ExprNodeColumnDesc(TypeInfoFactory.binaryTypeInfo,
- gbOutputNames.get(colPos), "", false);
- rsValueCols.add(colExpr);
- columnExprMap.put(gbOutputNames.get(colPos), colExpr);
- colPos++;
+ ColumnInfo colInfo =
Review Comment:
@deniskuzZ, I have checked your comment and my work, and I summarized my
conclusion as follows:
1. about `ReduceField.VALUE`
I think we should prepend the name of RS operator's columns in colExprMap
and schema because RS's child operators always access to input columns(output
of RS) as `KEY.col` and `VALUE.col`.
RS operator's output rows are transported to next operator via shuffle, not
by directly calling `Operator.forward()`. `ReduceRecordSource` reads shuffled
KV pairs and calls the child operator's `Operator.process()` on behalf of
`Operator.forward()`. If vectorization is disabled, it passes a `List<Object>`
of length 2 as a row, and the corresponding ObjectInspector consists of
`ReduceField.KEY` and `ReduceField.VALUE`. [1] If vectorization is enabled, it
passes a single struct object as a row, and the corresponding ObjectInspector
consists of `ReduceField.KEY + "." + fieldName` and `ReduceField.VALUE + "." +
fieldName`. [2] In both cases, the name of columns come from RS's child
operator start with either `ReduceField.KEY` or `ReduceField.VALUE`.
`colExprMap` maps an output column name to its expression [3], so the key of
`colExprMap` of RS should be prefixed by `ReduceField.KEY` or
`ReduceField.VALUE`. I could not find any information about `schema`, but it
seems that `schema` also represents output column names. [4] So I think both
`colExprMap`'s key and `schema` of RS should be prefixed by `ReduceField.KEY`
or `ReduceField.VALUE`.
2. about your investigation
DPPOptimization creates 4 operators, GBY->RS->GBY->RS, and
`sharedwork_semi_2.q` tests PEF by inverting one of the final RS operators that
DPPOptimization created. PEF refers to RS's `colExprMap` when creating a SEL
operator that performs inversion. That's why we fail this test due to
`java.lang.RuntimeException: cannot find field _col0 from [0:key, 1:value]` if
we do not prepend the key of final RS's `colExprMap` with `ReduceField.VALUE`.
Unlike final RS operators, intermediate RS operators are not inverted during
the test, so prepending intermediate RS operator's column name does not affect
to the result of the test.
3. about `ParallelEdgeFixer.colMappingInverseKeys()`
According to the comment of Operator.getColumnExprMap(), it returns only key
columns for RS operators. [3] I'm not sure whether it is still valid or not,
but I want to keep the added code as a kind of defensive programming.
[1]
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ReduceRecordSource.java#L229-L231
[2]
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L4333-L4341
[3]
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java#L991-L997
[4]
https://github.com/apache/hive/blob/8a4f5ce7275842ff4f1cc917c7a2a48dde71bf4c/ql/src/java/org/apache/hadoop/hive/ql/optimizer/CountDistinctRewriteProc.java#L443-L447
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]