wgtmac commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1113856412


##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java:
##########
@@ -289,8 +320,14 @@ public <T extends Comparable<T>> Boolean visit(Lt<T> lt) {
 
     T value = lt.getValue();
 
-    // drop if value <= min
-    return stats.compareMinToValue(value) >= 0;
+    // we are looking for records where v < someValue
+    if (stats.compareMinToValue(value) >= 0) {
+      // drop if value <= min
+      return BLOCK_CANNOT_MATCH;
+    } else {
+      // if value > min, we must take it
+      return BLOCK_MUST_MATCH;

Review Comment:
   I'm not aware of any implementation use lower/upper bounds instead of 
min/max values. But the specs does allow it to happen and `ColumnIndex` has 
stated about it 
[(link)](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L956)
 explicitly as mentioned by @gszadovszky 
   
   I prefer to be conservative and make it easy to use:
   - Regard statistics as lower/upper bounds unless `min == max`.
   - Do not add any config for an aggressive optimization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to