gszadovszky commented on code in PR #1328: URL: https://github.com/apache/parquet-java/pull/1328#discussion_r1613060962
########## parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ContainsRewriter.java: ########## @@ -0,0 +1,185 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.filter2.predicate; + +import java.util.Objects; +import org.apache.parquet.filter2.predicate.FilterPredicate.Visitor; +import org.apache.parquet.filter2.predicate.Operators.And; +import org.apache.parquet.filter2.predicate.Operators.Contains; +import org.apache.parquet.filter2.predicate.Operators.Eq; +import org.apache.parquet.filter2.predicate.Operators.Gt; +import org.apache.parquet.filter2.predicate.Operators.GtEq; +import org.apache.parquet.filter2.predicate.Operators.In; +import org.apache.parquet.filter2.predicate.Operators.LogicalNotUserDefined; +import org.apache.parquet.filter2.predicate.Operators.Lt; +import org.apache.parquet.filter2.predicate.Operators.LtEq; +import org.apache.parquet.filter2.predicate.Operators.Not; +import org.apache.parquet.filter2.predicate.Operators.NotEq; +import org.apache.parquet.filter2.predicate.Operators.NotIn; +import org.apache.parquet.filter2.predicate.Operators.Or; +import org.apache.parquet.filter2.predicate.Operators.UserDefined; + +/** + * Recursively rewrites Contains predicates composed using And or Or into a single Contains predicate + * containing all predicate assertions. + * + * This is a performance optimization, as all composed Contains sub-predicates must share the same column, and + * can therefore can be applied efficiently as a single predicate pass. Review Comment: nit: extra "can" (not sure which one :smile:) ########## parquet-column/src/test/java/org/apache/parquet/filter2/predicate/TestFilterApiMethods.java: ########## @@ -91,13 +93,46 @@ public void testFilterPredicateCreation() { assertEquals(ColumnPath.get("x", "y", "z"), ((Gt) gt).getColumn().getColumnPath()); } + @Test + public void testInvalidContainsCreation() { + FilterPredicate pred; + try { + pred = contains(eq(binColumn, null)); + fail("Contains predicate with null element value should fail"); + } catch (IllegalArgumentException e) { + assertEquals("Contains predicate does not support null element value", e.getMessage()); + } Review Comment: nit: you may use [Assert.assertThrows](https://junit.org/junit4/javadoc/4.13/org/junit/Assert.html#assertThrows(java.lang.String,%20java.lang.Class,%20org.junit.function.ThrowingRunnable)) ########## parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java: ########## @@ -54,6 +56,12 @@ public class PhoneBookWriter { + " optional binary kind (UTF8);\n" + " }\n" + " }\n" + + " optional group accounts {\n" Review Comment: nit: Let's use the `MAP` logical type according to the [spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps). ########## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java: ########## @@ -368,6 +369,102 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(NotIn<T> notIn) { return IndexIterator.all(getPageCount()); } + @Override + public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) { + return contains.filter( + this, + // For Contains.And, compute the intersection of the int ranges + (l, r) -> new PrimitiveIterator.OfInt() { + private Integer next = null; + + @Override + public int nextInt() { + int result = next; + next = null; + return result; + } Review Comment: I don't think this implementation follows the contract of an iterator. Iterators do not enforce the user to call `hasNext` before calling `next`. If you sure about the size, you may just call one `next` after another. It may throw a `NoSuchElementException` if there are no more values. The trick is to calculate the next value beforehand and check the existence of that value at `hasNext` and calculate the "next-next" value at `next` before returning the pre-calculated next value. (Hope it makes sense :smile:) ########## parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java: ########## @@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N return new NotIn<>(column, values); } + public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) { Review Comment: Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues. ########## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java: ########## @@ -368,6 +369,102 @@ public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(NotIn<T> notIn) { return IndexIterator.all(getPageCount()); } + @Override + public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(Contains<T> contains) { + return contains.filter( + this, + // For Contains.And, compute the intersection of the int ranges + (l, r) -> new PrimitiveIterator.OfInt() { + private Integer next = null; + + @Override + public int nextInt() { + int result = next; + next = null; + return result; + } Review Comment: See [IndexIterator](https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java) as an example. You may even put static methods there for `intersection` and `union` and it will be more readable here. I would not use boxing/unboxing here only to have a additional `null` value. Indices here can only be non-negative so you may use `-1` to mark the case of no more values. ########## parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java: ########## @@ -20,6 +20,7 @@ import static org.apache.parquet.filter2.predicate.FilterApi.and; import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; +import static org.apache.parquet.filter2.predicate.FilterApi.contains; Review Comment: During checking the code I've found that this record level testing class does not seem to be correct. In several places (just like in your code) the method `PhoneBookWriter.readFile` is used which calls [PhoneBookWriter.createReader](https://github.com/apache/parquet-java/blob/240ab0daa005e9cd9dcc9d3df716cd321761b307/parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java#L335-L339) to create the reader instance. The issue is, by default all the filtering (statistics/dictionary/column index) are working. As a result, we cannot be sure that the read results we are getting are really filtered only by the record level filter. I know this is not your fault but could you please try using a properly set up reader for this test? ```java // ... Configuration conf = new Configuration(); GroupWriteSupport.setSchema(schema, conf); return ParquetReader.builder(new GroupReadSupport(), file) .withConf(conf) .withFilter(filter) .withAllocator(allocator) .useBloomFilter(false) .useDictionaryFilter(false) .useStatsFilter(false) .useColumnIndexFilter(false) .build(); ``` If unrelated errors may occur, let's handle this separately, but I want to avoid not catching a potential issue in your code by leaving this test as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
