[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830440#comment-16830440
 ] 

ASF GitHub Bot commented on DRILL-7187:
---------------------------------------

amansinha100 commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279831169
 
 

 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/NumericEquiDepthHistogram.java
 ##########
 @@ -76,129 +73,188 @@ public NumericEquiDepthHistogram(int numBuckets) {
     numRowsPerBucket = -1;
   }
 
-  public long getNumRowsPerBucket() {
+  public double getNumRowsPerBucket() {
     return numRowsPerBucket;
   }
 
-  public void setNumRowsPerBucket(long numRows) {
+  public void setNumRowsPerBucket(double numRows) {
     this.numRowsPerBucket = numRows;
   }
 
   public Double[] getBuckets() {
     return buckets;
   }
 
+  /**
+   * Get the number of buckets in the histogram
+   * number of buckets is 1 less than the total # entries in the buckets array 
since last
+   * entry is the end point of the last bucket
+   */
+  public int getNumBuckets() {
+    return buckets.length - 1;
+  }
+
+  /**
+   * Estimate the selectivity of a filter which may contain several range 
predicates and in the general case is of
+   * type: col op value1 AND col op value2 AND col op value3 ...
+   *  <p>
+   *    e.g a > 10 AND a < 50 AND a >= 20 AND a <= 70 ...
+   *  </p>
+   * Even though in most cases it will have either 1 or 2 range conditions, we 
still have to handle the general case
+   * For each conjunct, we will find the histogram bucket ranges and intersect 
them, taking into account that the
+   * first and last bucket may be partially covered and all other buckets in 
the middle are fully covered.
+   */
   @Override
-  public Double estimatedSelectivity(final RexNode filter) {
-    if (numRowsPerBucket >= 0) {
-      // at a minimum, the histogram should have a start and end point of 1 
bucket, so at least 2 entries
-      Preconditions.checkArgument(buckets.length >= 2,  "Histogram has invalid 
number of entries");
-      final int first = 0;
-      final int last = buckets.length - 1;
-
-      // number of buckets is 1 less than the total # entries in the buckets 
array since last
-      // entry is the end point of the last bucket
-      final int numBuckets = buckets.length - 1;
-      final long totalRows = numBuckets * numRowsPerBucket;
+  public Double estimatedSelectivity(final RexNode columnFilter, final long 
totalRowCount) {
+    if (numRowsPerBucket == 0) {
+      return null;
+    }
+
+    // at a minimum, the histogram should have a start and end point of 1 
bucket, so at least 2 entries
+    Preconditions.checkArgument(buckets.length >= 2,  "Histogram has invalid 
number of entries");
+
+    List<RexNode> filterList = RelOptUtil.conjunctions(columnFilter);
+
+    Range<Double> fullRange = Range.all();
+    List<RexNode> unknownFilterList = new ArrayList<RexNode>();
+
+    Range<Double> valuesRange = getValuesRange(filterList, fullRange, 
unknownFilterList);
+
+    long numSelectedRows;
+    // unknown counter is a count of filter predicates whose bucket ranges 
cannot be
+    // determined from the histogram; this may happen for instance when there 
is an expression or
+    // function involved..e.g  col > CAST('10' as INT)
+    int unknown = unknownFilterList.size();
+
+    if (valuesRange.hasLowerBound() || valuesRange.hasUpperBound()) {
+      numSelectedRows = getSelectedRows(valuesRange);
+    } else {
+      numSelectedRows = 0;
+    }
+
+    if (numSelectedRows <= 0) {
+      return SMALL_SELECTIVITY;
+    } else {
+      // for each 'unknown' range filter selectivity, use a default of 0.5 
(matches Calcite)
+      double scaleFactor = Math.pow(0.5, unknown);
+      return  ((double) numSelectedRows / totalRowCount) * scaleFactor;
+    }
+  }
+
+  private Range<Double> getValuesRange(List<RexNode> filterList, Range<Double> 
fullRange, List<RexNode> unkownFilterList) {
+    Range<Double> currentRange = fullRange;
+    for (RexNode filter : filterList) {
       if (filter instanceof RexCall) {
-        // get the operator
-        SqlOperator op = ((RexCall) filter).getOperator();
-        if (op.getKind() == SqlKind.GREATER_THAN ||
-                op.getKind() == SqlKind.GREATER_THAN_OR_EQUAL) {
-          Double value = getLiteralValue(filter);
-          if (value != null) {
-
-            // *** Handle the boundary conditions first ***
-
-            // if value is less than or equal to the first bucket's start 
point then all rows qualify
-            int result = value.compareTo(buckets[first]);
-            if (result <= 0) {
-              return LARGE_SELECTIVITY;
-            }
-            // if value is greater than the end point of the last bucket, then 
none of the rows qualify
-            result = value.compareTo(buckets[last]);
-            if (result > 0) {
-              return SMALL_SELECTIVITY;
-            } else if (result == 0) {
-              if (op.getKind() == SqlKind.GREATER_THAN_OR_EQUAL) {
-                // value is exactly equal to the last bucket's end point so we 
take the ratio 1/bucket_width
-                long totalFilterRows = (long) (1 / (buckets[last] - 
buckets[last - 1]) * numRowsPerBucket);
-                double selectivity = (double) totalFilterRows / totalRows;
-                return selectivity;
-              } else {
-                // predicate is 'column > value' and value is equal to last 
bucket's endpoint, so none of
-                // the rows qualify
-                return SMALL_SELECTIVITY;
-              }
-            }
-
-            // *** End of boundary conditions ****
-
-            int n = getContainingBucket(value, numBuckets);
-            if (n >= 0) {
-              // all buckets to the right of containing bucket will be fully 
covered
-              int coveredBuckets = (last) - (n + 1);
-              long coveredRows = numRowsPerBucket * coveredBuckets;
-              // num matching rows in the current bucket is a function of 
(end_point_of_bucket - value)
-              long partialRows = (long) ((buckets[n + 1] - value) / (buckets[n 
+ 1] - buckets[n]) * numRowsPerBucket);
-              long totalFilterRows = partialRows + coveredRows;
-              double selectivity = (double)totalFilterRows/totalRows;
-              return selectivity;
-            } else {
-              // value does not exist in any of the buckets
-              return SMALL_SELECTIVITY;
-            }
-          }
-        } else if (op.getKind() == SqlKind.LESS_THAN ||
-                op.getKind() == SqlKind.LESS_THAN_OR_EQUAL) {
-          Double value = getLiteralValue(filter);
-          if (value != null) {
-
-            // *** Handle the boundary conditions first ***
-
-            // if value is greater than the last bucket's end point then all 
rows qualify
-            int result = value.compareTo(buckets[last]);
-            if (result >= 0) {
-              return LARGE_SELECTIVITY;
-            }
-            // if value is less than the first bucket's start point then none 
of the rows qualify
-            result = value.compareTo(buckets[first]);
-            if (result < 0) {
-              return SMALL_SELECTIVITY;
-            } else if (result == 0) {
-              if (op.getKind() == SqlKind.LESS_THAN_OR_EQUAL) {
-                // value is exactly equal to the first bucket's start point so 
we take the ratio 1/bucket_width
-                long totalFilterRows = (long) (1 / (buckets[first + 1] - 
buckets[first]) * numRowsPerBucket);
-                double selectivity = (double) totalFilterRows / totalRows;
-                return selectivity;
-              } else {
-                // predicate is 'column < value' and value is equal to first 
bucket's start point, so none of
-                // the rows qualify
-                return SMALL_SELECTIVITY;
-              }
-            }
-
-            // *** End of boundary conditions ****
-
-            int n = getContainingBucket(value, numBuckets);
-            if (n >= 0) {
-              // all buckets to the left will be fully covered
-              int coveredBuckets = n;
-              long coveredRows = numRowsPerBucket * coveredBuckets;
-              // num matching rows in the current bucket is a function of 
(value - start_point_of_bucket)
-              long partialRows = (long) ((value - buckets[n]) / (buckets[n + 
1] - buckets[n]) * numRowsPerBucket);
-              long totalFilterRows = partialRows + coveredRows;
-              double selectivity = (double)totalFilterRows / totalRows;
-              return selectivity;
-            } else {
-              // value does not exist in any of the buckets
-              return SMALL_SELECTIVITY;
-            }
+        Double value = getLiteralValue(filter);
+        if (value == null) {
+          unkownFilterList.add(filter);
+        } else {
+          // get the operator
+          SqlOperator op = ((RexCall) filter).getOperator();
+          Range range;
+          switch (op.getKind()) {
+            case GREATER_THAN:
+              range = Range.greaterThan(value);
+              currentRange = currentRange.intersection(range);
+              break;
+            case GREATER_THAN_OR_EQUAL:
+              range = Range.atLeast(value);
+              currentRange = currentRange.intersection(range);
+              break;
+            case LESS_THAN:
+              range = Range.lessThan(value);
+              currentRange = currentRange.intersection(range);
+              break;
+            case LESS_THAN_OR_EQUAL:
+              range = Range.atMost(value);
+              currentRange = currentRange.intersection(range);
+              break;
+            default:
+              break;
           }
         }
       }
     }
-    return null;
+    return currentRange;
+  }
+
+  private long getSelectedRows(final Range range) {
+    final int numBuckets = buckets.length - 1;
+    double startBucketFraction = 1.0;
+    double endBucketFraction = 1.0;
+    long numRows = 0;
+    int result;
+    Double lowValue = null;
+    Double highValue = null;
+    final int first = 0;
+    final int last = buckets.length - 1;
+    int startBucket = first;
+    int endBucket = last;
+
+    if (range.hasLowerBound()) {
 
 Review comment:
   There are some key differences in the comparisons and the the bucket ranges 
that make it difficult to consolidate and even if I was to do it it would look 
somewhat contrived.  Let me know if you have something specific in mind.  For 
now I have left it as is. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> -----------------------------------------------------------------------
>
>                 Key: DRILL-7187
>                 URL: https://issues.apache.org/jira/browse/DRILL-7187
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>             Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to