wgtmac commented on code in PR #1023: URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1096546533
########## parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java: ########## @@ -98,16 +99,19 @@ public List<BlockMetaData> visit(FilterCompat.FilterPredicateCompat filterPredic for (BlockMetaData block : blocks) { boolean drop = false; + // Whether one filter can exactly determine the existence/nonexistence of the value. + // If true then we can skip the remaining filters to save time and space. + AtomicBoolean canExactlyDetermine = new AtomicBoolean(false); Review Comment: Why atomic? ########## parquet-hadoop/src/test/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilterTest.java: ########## @@ -792,6 +793,16 @@ public void testInverseUdpMissingColumn() throws Exception { canDrop(LogicalInverseRewriter.rewrite(not(userDefined(fake, nullRejecting))), ccmd, dictionaries)); } + @Test + public void testCanSkipOtherFilters() { Review Comment: The test looks a little bit insufficient. More kinds of predicates and compound predicates need to be covered. Also test of `RowGroupFilter` is missing. ########## parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java: ########## @@ -559,4 +591,12 @@ private static boolean hasNonDictionaryPages(ColumnChunkMetaData meta) { return true; } } + + private <T extends Comparable<T>> void markCanExactlyDetermine(Set<T> dictSet) { + if (dictSet == null) { + canExactlyDetermine = false; Review Comment: It seems that `canExactlyDetermine` should use `OR` to update its value. Otherwise, any predicate with a null dict will set it to false even if previous predicates have marked it to true. Additionally, we may have a chance to shortcut the evaluation as well if any predicate has set it to true. ########## parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java: ########## @@ -98,16 +99,19 @@ public List<BlockMetaData> visit(FilterCompat.FilterPredicateCompat filterPredic for (BlockMetaData block : blocks) { boolean drop = false; + // Whether one filter can exactly determine the existence/nonexistence of the value. + // If true then we can skip the remaining filters to save time and space. + AtomicBoolean canExactlyDetermine = new AtomicBoolean(false); Review Comment: I'd suggest rename `canExactlyDetermine` to `preciselyDetermined`. Or even better, use an enum something like below ```java enum PredicateEvaluation { CAN_DROP, /* the block can be dropped for sure */ CANNOT_DROP, /* the block cannot be dropped for sure*/ MAY_DROP, /* cannot decide yet, may be dropped by other filter levels */ } ``` In this way, we can merge the the two boolean values here. The downside is that the code may need more refactoring to add the enum value to different filter classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org