zabetak commented on code in PR #6503:
URL: https://github.com/apache/hive/pull/6503#discussion_r3372441852


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/SearchTransformer.java:
##########
@@ -72,30 +72,44 @@ public SearchTransformer(RexBuilder rexBuilder, RexCall 
search, final RexUnknown
     this.unknownContext = unknownContext;
   }
 
+  /**
+   * Transforms the SEARCH expression into an equivalent RexNode expression.
+   * Warning: when called from a shuttle, callers of this method should 
consider flattening AND/OR expressions
+   * afterward, to get the same result as applying {@link 
SearchTransformer.Shuttle}.
+   */
   public RexNode transform() {
     PerfLogger perfLogger = SessionState.getPerfLogger();
     perfLogger.perfLogBegin(this.getClass().getName(), 
PerfLogger.SEARCH_TRANSFORMER);
 
-    RangeConverter<C> consumer = new RangeConverter<>(rexBuilder, operandType, 
ref);
-    RangeSets.forEach(sarg.rangeSet, consumer);
-
     List<RexNode> orList = new ArrayList<>();
     if (sarg.nullAs == RexUnknownAs.TRUE && unknownContext != 
RexUnknownAs.TRUE) {
       orList.add(rexBuilder.makeCall(SqlStdOperatorTable.IS_NULL, ref));
     }
-    switch (consumer.inLiterals.size()) {
-    case 0:
-      break;
-    case 1:
-      orList.add(rexBuilder.makeCall(SqlStdOperatorTable.EQUALS, ref, 
consumer.inLiterals.get(0)));
-      break;
-    default:
-      List<RexNode> operands = new ArrayList<>(consumer.inLiterals.size() + 1);
-      operands.add(ref);
-      operands.addAll(consumer.inLiterals);
-      orList.add(rexBuilder.makeCall(HiveIn.INSTANCE, operands));
+
+    if (sarg.isComplementedPoints()) {

Review Comment:
   Yeap, that's the idea; check both and pick the simpler. In fact, the 
`sarg.isComplementedPoints` is already doing this in a systematic fashion but I 
was suggesting to generalize it a bit more.
   
   I am also in favor of doing this in a follow-up (if necessary) and not as 
part of this PR, which is almost ready to merge.
   



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -628,14 +628,23 @@ private RexNode makeLiteral(C value) {
     private double compute() {
       final List<RexNode> inLiterals = new ArrayList<>();
       final List<Double> rangeSelectivities = new ArrayList<>();
-      for (Range<C> range : sarg.rangeSet.asRanges()) {
-        if (!range.hasLowerBound() && !range.hasUpperBound()) {
-          return 1.0; // "all" range
+      final List<Double> searchSelectivities = new ArrayList<>();
+
+      if (sarg.isComplementedPoints()) {
+        // Generate 'ref <> value1 AND ... AND ref <> valueN'
+        List<RexNode> notEq = sarg.rangeSet.complement().asRanges().stream()
+            .map(range -> rexBuilder.makeCall(SqlStdOperatorTable.NOT_EQUALS, 
ref, makeLiteral(range.lowerEndpoint())))
+            .toList();
+        searchSelectivities.add(RexUtil.composeConjunction(rexBuilder, 
notEq).accept(FilterSelectivityEstimator.this));
+      } else {

Review Comment:
   > However, it might be possible that histograms are not available in 
general...
   
   When histograms are not available then I guess all the following ranges are 
gonna fall to the hard-coded default selectivity:
   ```
   SEARCH($0, Sarg[(50..100)])
   SEARCH($0, Sarg[(50..100]])
   SEARCH($0, Sarg[[50..100)])
   SEARCH($0, Sarg[[50..100]])
   SEARCH($0, Sarg[[50..+∞)])
   SEARCH($0, Sarg[(-∞..100]])
   ...
   ```
   However, if we have more stats available (i.e., NDV, min, max, null counts) 
for column `$0` then possibly we could return something more accurate than 
`1/3` (or `selDisjuction(1/3, 1/3)`) even though histograms are missing. If we 
implement this improvement then also `SEARCH` expressions with complemented 
points (such as the one below) could benefit for better selectivities.
   ```
   SEARCH($0, Sarg[(-∞..10), (10..20), (20..30), (30, +∞)])
   ```
   Estimates are not the main focus of this PR so I would be fine to keep the 
code simpler at this stage and log another JIRA ticket for targeting the bigger 
problem "Improve range selectivity estimations in the absence of histograms". 
   
   Another caveat is that if we have a large complement then we are going to 
perform a potentially expensive check + transform + tree traversal just for the 
purpose of selectivity that may be quite off and may not worth it overall.
   
   For the above, I am also leaning towards reverting the change for the time 
being.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to