jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file URL: https://github.com/apache/drill/pull/1298#discussion_r203966145
########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java ########## @@ -124,8 +124,7 @@ private static LogicalExpression createIsTruePredicate(LogicalExpression expr) { */ private static LogicalExpression createIsFalsePredicate(LogicalExpression expr) { return new ParquetIsPredicate<Boolean>(expr, (exprStat, evaluator) -> - //if min value is not false or if there are all nulls -> canDrop - isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() + isAllNulls(exprStat, evaluator.getRowCount()) || exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin() ? RowsMatch.NONE : checkNull(exprStat) Review comment: blnTbl/0_0_1.parquet => ST:[min: false, max: false, num_nulls: 0] : 8 tests in testBooleanPredicate() tfTbl/ft0.parquet => ST:[min: false, max: true, num_nulls: 0] : 4 tests in testBooleanPredicate example1: select * from `ava-exec/src/test/resources/parquetFilterPush/blnTbl/0_0_1.parquet` where col_bln is false returns (false, false, false) example2: select * from `java-exec/src/test/resources/parquetFilterPush/tfTbl/ft0.parquet` where a is true[resp. false] return true[resp. false] Finally, when running the query select * from dfs.tmp.`blnTbl` where col_bln is false with blnTbl contains only 0_0_0.parquet (T,T,T) and 0_0_1.parquet (F,F,F) the physical plan reads: 00-00 Screen : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0, cumulative cost = {9.3 rows, 12.3 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 523 00-01 Project(**=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0, cumulative cost = {9.0 rows, 12.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 522 00-02 Project(**=[$0]) : rowType = RecordType(DYNAMIC_STAR **): rowcount = 3.0, cumulative cost = {6.0 rows, 9.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 521 00-03 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/blnTbl/0_0_1.parquet]], selectionRoot=file:/tmp/blnTbl, numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY col_bln): rowcount = 3.0, cumulative cost = {3.0 rows, 6.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 520 No more filter since it returns NONE for 0_0_0.parquet and ALL for 0_0_1.parquet. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services