wgtmac commented on code in PR #1023: URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1113856412
########## parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java: ########## @@ -289,8 +320,14 @@ public <T extends Comparable<T>> Boolean visit(Lt<T> lt) { T value = lt.getValue(); - // drop if value <= min - return stats.compareMinToValue(value) >= 0; + // we are looking for records where v < someValue + if (stats.compareMinToValue(value) >= 0) { + // drop if value <= min + return BLOCK_CANNOT_MATCH; + } else { + // if value > min, we must take it + return BLOCK_MUST_MATCH; Review Comment: I'm not aware of any implementation use lower/upper bounds instead of min/max values. But the specs does allow it to happen and `ColumnIndex` has stated about it [(link)](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L956) explicitly as mentioned by @gszadovszky I prefer to be conservative and make it easy to use: - Regard statistics as lower/upper bounds unless `min == max`. - Do not add any config for an aggressive optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org