Re: [PR] HIVE-28254: CBO (Calcite Return Path): Multiple DISTINCT leads to wrong results [hive]

via GitHub Thu, 16 May 2024 06:40:53 -0700


okumin commented on code in PR #5245:
URL: https://github.com/apache/hive/pull/5245#discussion_r1603371712



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/opconventer/HiveGBOpConvUtil.java:
##########
@@ -1028,32 +1028,65 @@ private static OpAttr genReduceSideGB1NoMapGB(OpAttr 
inputOpAf, GBInfo gbInfo,
     List<ExprNodeDesc> reduceValues = rs.getConf().getValueCols();
     ArrayList<AggregationDesc> aggregations = new ArrayList<AggregationDesc>();
     int udafColStartPosInOriginalGB = gbInfo.gbKeys.size();
-    // the positions in rsColInfoLst are as follows
-    // --grpkey--,--distkey--,--values--
-    // but distUDAF may be before/after some non-distUDAF,
-    // i.e., their positions can be mixed.
-    // so for all UDAF we first check to see if it is groupby key, if not is 
it distinct key
-    // if not it should be value
-    Map<Integer, List<ExprNodeDesc>> indexToParameter = new TreeMap<>();
+
+    final List<List<ColumnInfo>> paramColInfoTable = new 
ArrayList<>(gbInfo.udafAttrs.size());
+    final List<List<String>> distinctColumnNameTable = new 
ArrayList<>(gbInfo.udafAttrs.size());
+    final Map<ColumnInfo, String> distinctColumnMapping = new HashMap<>();
     for (int i = 0; i < gbInfo.udafAttrs.size(); i++) {
-      UDAFAttrs udafAttr = gbInfo.udafAttrs.get(i);
-      ArrayList<ExprNodeDesc> aggParameters = new ArrayList<ExprNodeDesc>();
+      final UDAFAttrs udafAttr = gbInfo.udafAttrs.get(i);
+      final List<ColumnInfo> paramColInfo = new 
ArrayList<>(udafAttr.udafParams.size());
+      final List<String> distinctColNames = new 
ArrayList<>(udafAttr.udafParams.size());
 
-      ColumnInfo rsUDAFParamColInfo;
-      ExprNodeDesc udafParam;
-      ExprNodeDesc constantPropDistinctUDAFParam;
       for (int j = 0; j < udafAttr.udafParams.size(); j++) {
-        int argPos = getColInfoPos(udafAttr.udafParams.get(j), gbInfo);
-        rsUDAFParamColInfo = rsColInfoLst.get(argPos);
-        String rsUDAFParamName = rsUDAFParamColInfo.getInternalName();
+        final int argPos = getColInfoPos(udafAttr.udafParams.get(j), gbInfo);
+        final ColumnInfo rsUDAFParamColInfo = rsColInfoLst.get(argPos);
+        paramColInfo.add(rsUDAFParamColInfo);
 
+        final String distinctColumnName;
         if (udafAttr.isDistinctUDAF && lastReduceKeyColName != null) {
-          rsUDAFParamName = Utilities.ReduceField.KEY.name() + "." + 
lastReduceKeyColName
+          distinctColumnName = Utilities.ReduceField.KEY.name() + "." + 
lastReduceKeyColName
                   + ":" + numDistinctUDFs + "." + 
SemanticAnalyzer.getColumnInternalName(j);
+          distinctColumnMapping.putIfAbsent(rsUDAFParamColInfo, 
distinctColumnName);
+        } else {
+          distinctColumnName = null;
+        }
+        distinctColNames.add(distinctColumnName);
+      }
+
+      paramColInfoTable.add(paramColInfo);
+      distinctColumnNameTable.add(distinctColNames);
+
+      if(udafAttr.isDistinctUDAF) {
+        numDistinctUDFs++;
+      }
+    }
+
+    // the positions in rsColInfoLst are as follows
+    // --grpkey--,--distkey--,--values--
+    // but distUDAF may be before/after some non-distUDAF,
+    // i.e., their positions can be mixed.
+    // so for all UDAF we first check to see if it is groupby key, if not is 
it distinct key
+    // if not it should be value
+    final Map<Integer, List<ExprNodeDesc>> indexToParameter = new TreeMap<>();
+    for (int i = 0; i < paramColInfoTable.size(); i++) {
+      final ArrayList<ExprNodeDesc> aggParameters = new ArrayList<>();
+
+      for (int j = 0; j < paramColInfoTable.get(i).size(); j++) {
+        final ColumnInfo rsUDAFParamColInfo = paramColInfoTable.get(i).get(j);
+
+        final String rsUDAFParamName;
+        if (distinctColumnNameTable.get(i).get(j) != null) {
+          rsUDAFParamName = distinctColumnNameTable.get(i).get(j);
+        } else if (distinctColumnMapping.get(rsUDAFParamColInfo) != null) {
+          // This UDAF is not labeled with DISTINCT, but it refers to a 
DISTINCT key.
+          // The original internal name is already obsolete as any DISTINCT 
keys are renamed.
+          rsUDAFParamName = distinctColumnMapping.get(rsUDAFParamColInfo);
+        } else {
+          rsUDAFParamName = rsUDAFParamColInfo.getInternalName();

Review Comment:
   Why not. Let me dive into the current behavior. I'm also not so familiar 
with the relevant code.
   
   ## Sample query
   
   I will use the following query to illustrate a specific case.
   
   ```
   CREATE TABLE test (agg_key STRING, udaf_key_1 INT, udaf_key_2 INT);
   
   set hive.cbo.returnpath.hiveop=true;
   set hive.map.aggr=false;
   
   SELECT
     agg_key,
     SUM(DISTINCT udaf_key_1),
     COUNT(DISTINCT udaf_key_1),
     SUM(DISTINCT udaf_key_2),
     SUM(udaf_key_2)
   FROM test
   GROUP BY agg_key;
   ```
   
   ## How to encode keys on ReduceSinkOperator
   
   [ReduceSinkOperator stores (sort) keys and one union of all DISTINCT keys in 
the list of 
keys](https://github.com/apache/hive/blob/rel/release-4.0.0/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L257-L289).
   
   For the sample query, the actual ReduceSinkDesk is an object like below.
   
   ```
   ReduceSinkDesk {
     keyCols = [agg_key, udaf_key_1, udaf_key_2]
     outputKeyColumnNames = [_col0, _col1]
     distinctColumnIndices = [1, 1, 2]
     numDistributionKeys = 1
   }
   ```
   
   Then, the following struct will be generated as a list of keys.
   
   ```
   {
     _col0 = Text for agg_key
     _col1 = UnionObject for (udaf_key_1, udaf_key_1, udaf_key_2)
   }
   ```
   
   And its RowSchema of RS is `[KEY._col0, KEY._col1:0._col0, 
KEY._col1:1._col0]`. To be honest, I am not sure how this is configured, but it 
makes sense to have deduplicated expressions.
   
   `KEY.%s:%s.%s` is an expression for a UNION column.
   
   ## How Return Path configures UDAF keys
   
   As you see, HiveGBOpConvUtil names UDAFs keys as below.
   
   - If it is a distinct key, new names are assigned, `KEY.{the final output 
col name}:{position of UDAF}.{position of arguments of the UDAF}`
       - `KEY._col1:0.col0` for the arg of `SUM(DISTINCT udaf_key_1)`
       - `KEY._col1:1.col0` for the arg of `COUNT(DISTINCT udaf_key_1)`
       - `KEY._col1:2.col0` for the arg of `SUM(DISTINCT udaf_key_2)`
   - If it is not a distinct key, the internal name of ReduceSink is used as it 
is
       - `KEY._col1:1._col0` for the arg of `SUM(udaf_key_2)`, derived from 
naming of RSO, `[KEY._col0, KEY._col1:0._col0, KEY._col1:1._col0]`
           - This conflicts with the arg of `COUNT(DISTINCT udaf_key_1)`
   
   ## Problem
   
   [The configured param names must be ones for the world of 
GroupByOperator](https://github.com/apache/hive/blob/rel/release-4.0.0/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java#L281-L300).
 But the current implementation uses a name in the world of ReduceSinkOperator. 
In the worst case, the name in ReduceSinkOperator can conflict with the name 
for a newly created GroupByOperator, and can access a wrong column.
   
   ## Questions
   
   - Why does it happen only with `hive.map.aggr=false`?
       - With `hive.map.aggr=true`, it is basically impossible to reuse an 
expression of a DISTINCT key that could be deduplicated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28254: CBO (Calcite Return Path): Multiple DISTINCT leads to wrong results [hive]

Reply via email to