[ https://issues.apache.org/jira/browse/HIVE-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988631#comment-13988631 ]
Sun Rui commented on HIVE-7012: ------------------------------- I am thinking about the following fix, but not sure if right: sameKeys(): ExprNodeDesc pexpr = pexprs.get(i); ExprNodeDesc cexpr = ExprNodeDescUtils.backtrack(cexprs.get(i), child, parent); // check if cexpr is from the parent if (cexpr == null || (cexpr not contained in the colExprMap of the parent operator) || !pexpr.isSame(cexpr)) { return null; } > Wrong RS de-duplication in the ReduceSinkDeDuplication Optimizer > ---------------------------------------------------------------- > > Key: HIVE-7012 > URL: https://issues.apache.org/jira/browse/HIVE-7012 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.13.0 > Reporter: Sun Rui > > With HIVE 0.13.0, run the following test case: > {code:sql} > create table src(key bigint, value string); > select > count(distinct key) as col0 > from src > order by col0; > {code} > The following exception will be thrown: > {noformat} > java.lang.RuntimeException: Error in configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:485) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) > ... 9 more > Caused by: java.lang.RuntimeException: Reduce operator initialization failed > at > org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:173) > ... 14 more > Caused by: java.lang.RuntimeException: cannot find field _col0 from > [0:reducesinkkey0] > at > org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415) > at > org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:150) > at > org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:79) > at > org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:288) > at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:376) > at > org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:166) > ... 14 more > {noformat} > This issue is related to HIVE-6455. When hive.optimize.reducededuplication is > set to false, then this issue will be gone. > Logical plan when hive.optimize.reducededuplication=false; > {noformat} > src > TableScan (TS_0) > alias: src > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE > Select Operator (SEL_1) > expressions: key (type: bigint) > outputColumnNames: key > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: > NONE > Group By Operator (GBY_2) > aggregations: count(DISTINCT key) > keys: key (type: bigint) > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: > NONE > Reduce Output Operator (RS_3) > istinctColumnIndices: > key expressions: _col0 (type: bigint) > DistributionKeys: 0 > sort order: + > OutputKeyColumnNames: _col0 > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column > stats: NONE > Group By Operator (GBY_4) > aggregations: count(DISTINCT KEY._col0:0._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > Select Operator (SEL_5) > expressions: _col0 (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > Reduce Output Operator (RS_6) > key expressions: _col0 (type: bigint) > DistributionKeys: 1 > sort order: + > OutputKeyColumnNames: reducesinkkey0 > OutputVAlueColumnNames: _col0 > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > value expressions: _col0 (type: bigint) > Extract (EX_7) > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > File Output Operator (FS_8) > compressed: false > Statistics: Num rows: 1 Data size: 16 Basic stats: > COMPLETE Column stats: NONE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > {noformat} > You will see that RS_3 and RS_6 are not merged. > Logical plan when hive.optimize.reducededuplication=true; > {noformat} > src > TableScan (TS_0) > alias: src > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE > Select Operator (SEL_1) > expressions: key (type: bigint) > outputColumnNames: key > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: > NONE > Group By Operator (GBY_2) > aggregations: count(DISTINCT key) > keys: key (type: bigint) > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: > NONE > Reduce Output Operator (RS_3) > istinctColumnIndices: > key expressions: _col0 (type: bigint) > DistributionKeys: 1 > sort order: + > OutputKeyColumnNames: reducesinkkey0 > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column > stats: NONE > Group By Operator (GBY_4) > aggregations: count(DISTINCT KEY._col0:0._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > Select Operator (SEL_5) > expressions: _col0 (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > File Output Operator (FS_8) > compressed: false > Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE > Column stats: NONE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > {noformat} > You will see that RS_6 has been merged into RS_3. However, Obviously the > merge is incorrect because RS_3 and RS_6 have different sort keys. (The sort > key for RS_3 is > key and the sort key for RS_6 is count(distinct key)). > The problem is that the method sameKeys() returns the result that both RS > have same keys. sameKeys() depends ExprNodeDescUtils.backtrack() to backtrack > a key expr of cRS to pRS. > I don't understand the logical behind the following logic in > ExprNodeDescUtils: > Why still backtrack when there is no mapping for the column of the current > operator? > {code} > private static ExprNodeDesc backtrack(ExprNodeColumnDesc column, > Operator<?> current, > Operator<?> terminal) throws SemanticException { > ... > if (mapping == null || !mapping.containsKey(column.getColumn())) { > return backtrack((ExprNodeDesc)column, current, terminal); > } > ... > } > {code} > The process of backtracking _col0 of cRS to pRS: > RS_6:_col0 --> SEL_5:_col0 --> GBY_4:_col0 (because the colExprMap is null > for GBY_4) --> RS_3:_col0 (No mapping for output column _col0), which is a > wrong backtrack. -- This message was sent by Atlassian JIRA (v6.2#6252)