[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

ASF GitHub Bot (Jira) Fri, 17 Feb 2023 00:50:08 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690258#comment-17690258
 ]


ASF GitHub Bot commented on PARQUET-2237:
-----------------------------------------

yabola commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1109441802


##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/PredicateEvaluation.java:
##########
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.filter2.compat;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.predicate.Operators;
+
+/**
+ * Used in Filters to mark whether we should DROP the block if data matches 
the condition.
+ * If we cannot decide whether the block matches, it will be always safe to 
return BLOCK_MIGHT_MATCH.
+ * We use Boolean Object here to distinguish the value type, please do not 
modify it.
+ */
+public class PredicateEvaluation {
+  /* The block might match, but we cannot decide yet, will check in the other 
filters. */
+  public static final Boolean BLOCK_MIGHT_MATCH = new Boolean(false);
+  /* The block can match for sure. */
+  public static final Boolean BLOCK_MUST_MATCH = new Boolean(false);
+  /* The block can't match for sure */
+  public static final Boolean BLOCK_CANNOT_MATCH = new Boolean(true);
+
+  public static Boolean evaluateAnd(Operators.And and, 
FilterPredicate.Visitor<Boolean> predicate) {
+    Boolean left = and.getLeft().accept(predicate);
+    if (left == BLOCK_CANNOT_MATCH) {
+      // seems unintuitive to put an || not an && here but we can
+      // drop a chunk of records if we know that either the left or
+      // the right predicate agrees that no matter what we don't
+      // need this chunk.
+      return BLOCK_CANNOT_MATCH;
+    }
+    Boolean right = and.getRight().accept(predicate);
+    if (right == BLOCK_CANNOT_MATCH) {
+      return BLOCK_CANNOT_MATCH;
+    } else if (left == BLOCK_MUST_MATCH && right == BLOCK_MUST_MATCH) {

Review Comment:
   if left is `BLOCK_MUST_MATCH` , right is  `BLOCK_MIGHT_MATCH` , left & right 
should be `BLOCK_MIGHT_MATCH`.
   Because in the next filter may right may be `BLOCK_CANNOT_MATCH ` and we 
should drop it.
   
   And I add new 
[UT](https://github.com/apache/parquet-mr/pull/1023/files#diff-8915e6fa23018e02c2e79a3f6cc5078a8882f8031022dbdde217fe9bf1d908afR143)
   In  `StatisticsFilter` left might match (but can't match in 
DictionaryFilter),  right must match -> return might match in StatisticsFilter, 
 return can't match in DictionaryFilter
   
   





> Improve performance when filters in RowGroupFilter can match exactly
> --------------------------------------------------------------------
>
>                 Key: PARQUET-2237
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2237
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

Reply via email to