[GitHub] [hive] jcamachor commented on a change in pull request #970: Hive 23100

GitBox Thu, 16 Apr 2020 21:09:13 -0700

jcamachor commented on a change in pull request #970: Hive 23100
URL: https://github.com/apache/hive/pull/970#discussion_r409981518


 ##########
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
 ##########
 @@ -972,55 +1005,42 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode expr,
           }
         }
         if (genericUDF instanceof GenericUDFIn) {
-
-          T columnDesc = children.get(0);
-          List<T> outputOpList = children.subList(1, children.size());
-          List<T> inOperands = new ArrayList<>(outputOpList);
-          outputOpList.clear();
-
-          boolean hasNullValue = false;
-          for (T oldChild : inOperands) {
-            if (oldChild == null) {
-              hasNullValue = true;
-              continue;
-            }
-            T newChild = interpretNodeAsStruct(columnDesc, oldChild);
-            if (newChild == null) {
-              hasNullValue = true;
-              continue;
-            }
-            outputOpList.add(newChild);
-          }
-
-          if (hasNullValue) {
-            T nullConst = 
exprFactory.createConstantExpr(exprFactory.getTypeInfo(columnDesc), null);
-            if (outputOpList.size() == 0) {
-              // we have found only null values...remove the IN ; it will be 
null all the time.
-              return nullConst;
+          ListMultimap<TypeInfo, T> expressions = ArrayListMultimap.create();
+          for (int i = 1; i < children.size(); i++) {
+            T columnDesc = children.get(0);
+            T valueDesc = interpretNodeAsConstantStruct(columnDesc, 
children.get(i));
+            if (valueDesc == null) {
+              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              if (!expressions.containsKey(targetType)) {
+                expressions.put(targetType, columnDesc);
+              }
+              T nullConst = exprFactory.createConstantExpr(targetType, null);
+              expressions.put(targetType, nullConst);
+            } else {
+              TypeInfo targetType = exprFactory.getTypeInfo(valueDesc);
+              if (!expressions.containsKey(targetType)) {
+                expressions.put(targetType, columnDesc);
+              }
+              expressions.put(targetType, valueDesc);
             }
-            outputOpList.add(nullConst);
           }
 
-          if (!ctx.isCBOExecuted()) {
-
-            HiveConf conf;
-            try {
-              conf = Hive.get().getConf();
-            } catch (HiveException e) {
-              throw new SemanticException(e);
-            }
-            if (children.size() <= HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVEOPT_TRANSFORM_IN_MAXNODES)) {
-              List<T> orOperands = 
exprFactory.rewriteINIntoORFuncCallExpr(children);
-              if (orOperands != null) {
-                if (orOperands.size() == 1) {
-                  
orOperands.add(exprFactory.createBooleanConstantExpr(Boolean.FALSE.toString()));
-                }
-                funcText = "or";
-                genericUDF = new GenericUDFOPOr();
-                children.clear();
-                children.addAll(orOperands);
-              }
+          children.clear();
+          List<T> newExprs = new ArrayList<>();
+          int numEntries = expressions.keySet().size();
+          if (numEntries == 1) {
+            children.addAll(expressions.asMap().values().iterator().next());
+            funcText = "in";
 
 Review comment:
   I am going to add more information here because I think all this part was 
misleading, but here is the basic idea. We split the IN clause into different 
IN clauses, one for each different value type. As you know, Hive and Calcite 
treat types in IN clauses differently and it was practically impossible to find 
some correct implementation with a common part unless this was done. Then I 
rely on Calcite to merge the IN clauses if they are type compatible. This seems 
to be working fine.
   I think this also answers your questions below.
   1) We may have multiple values per type (observe it is a multimap). This is 
precisely what I want, to group those values by type.
   2) In previous version, when IN clause had multiple null values, a single 
null value was added. Now I add multiple null values (basically I omitted the 
deduplication). I have added the deduplication back, although this can be done 
by Calcite.
   3) For the twisted part, the initialization needs be done within the 
children processing because it depends on the type of the value itself, unless 
it is null in which case it depends on the column type.
   
   Could this logic be improved or rewritten? Definitely... But I can tell you 
I work hard on this and the problem is that there are two many corner cases and 
two different type systems, and it is very easy that even the most minimal 
change produces incorrect results in any of them. That is why I came up with 
this solution to split IN clause depending on value type.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] jcamachor commented on a change in pull request #970: Hive 23100

Reply via email to