[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422814#comment-17422814
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-931370971


   Thank you for your contribution, @huaxingao! Great work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Assignee: Huaxin Gao
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422810#comment-17422810
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-931369233


   @gszadovszky @shangxinli @viirya @dbtsai Thank you so much for all your 
help!!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Assignee: Huaxin Gao
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17422575#comment-17422575
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky merged pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421770#comment-17421770
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717696646



##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax {
-  private PrimitiveComparator comparator;
-  private Iterator iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator iterator) {
-this.comparator = comparator;
-this.iterator = iterator;
-getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable iterable) {

Review comment:
   Fixed. Thanks!

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -47,43 +39,22 @@ public T getMax() {
 return max;
   }
 
-  private void getMinAndMax() {
-while(iterator.hasNext())  {
-  T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable 
iterable) {
+iterable.forEach(element -> {
   if (max == null) {
 max = element;
-  } else if (max != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)max, 
(Integer)element) < 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)max, 
(Binary)element) < 0) ||
-  (element instanceof Double &&
- ((PrimitiveComparator)comparator).compare((Double)max, 
(Double)element) < 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)max, 
(Float)element) < 0) ||
-  (element instanceof Boolean &&
-((PrimitiveComparator)comparator).compare((Boolean)max, 
(Boolean)element) < 0) ||
-  (element instanceof Long &&
-((PrimitiveComparator)comparator).compare((Long) max, 
(Long)element) < 0))
+  } else if (element != null) {
+if (comparator.compare(max, element) < 0) {

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421642#comment-17421642
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717354327



##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax {
-  private PrimitiveComparator comparator;
-  private Iterator iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator iterator) {
-this.comparator = comparator;
-this.iterator = iterator;
-getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable iterable) {

Review comment:
   I think,  you should get warnings if you use a generic class without the 
generic type. `PrimitiveComparator` or even `Comparator` should work fine.

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -47,43 +39,22 @@ public T getMax() {
 return max;
   }
 
-  private void getMinAndMax() {
-while(iterator.hasNext())  {
-  T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable 
iterable) {
+iterable.forEach(element -> {
   if (max == null) {
 max = element;
-  } else if (max != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)max, 
(Integer)element) < 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)max, 
(Binary)element) < 0) ||
-  (element instanceof Double &&
- ((PrimitiveComparator)comparator).compare((Double)max, 
(Double)element) < 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)max, 
(Float)element) < 0) ||
-  (element instanceof Boolean &&
-((PrimitiveComparator)comparator).compare((Boolean)max, 
(Boolean)element) < 0) ||
-  (element instanceof Long &&
-((PrimitiveComparator)comparator).compare((Long) max, 
(Long)element) < 0))
+  } else if (element != null) {
+if (comparator.compare(max, element) < 0) {

Review comment:
   nit: You may combine the two with an `&&`.

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -47,43 +39,22 @@ public T getMax() {
 return max;
   }
 
-  private void getMinAndMax() {
-while(iterator.hasNext())  {
-  T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable 
iterable) {
+iterable.forEach(element -> {
   if (max == null) {
 max = element;
-  } else if (max != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)max, 
(Integer)element) < 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)max, 
(Binary)element) < 0) ||
-  (element instanceof Double &&
- ((PrimitiveComparator)comparator).compare((Double)max, 
(Double)element) < 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)max, 
(Float)element) < 0) ||
-  (element instanceof Boolean &&
-((PrimitiveComparator)comparator).compare((Boolean)max, 
(Boolean)element) < 0) ||
-  (element instanceof Long &&
-((PrimitiveComparator)comparator).compare((Long) max, 
(Long)element) < 0))
+  } else if (element != null) {
+if (comparator.compare(max, element) < 0) {
   max = element;
+}
   }
   if (min == null) {
 min = element;
-  } else if (min != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)min, 
(Integer)element) > 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)min, 
(Binary)element) > 0) ||
-  (element instanceof Double &&
-((PrimitiveComparator)comparator).compare((Double)min, 
(Double)element) > 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)min, 
(Float)element) > 0) ||
-  (element instanceof Boolean &&
-

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421458#comment-17421458
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717696646



##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax {
-  private PrimitiveComparator comparator;
-  private Iterator iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator iterator) {
-this.comparator = comparator;
-this.iterator = iterator;
-getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable iterable) {

Review comment:
   Fixed. Thanks!

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -47,43 +39,22 @@ public T getMax() {
 return max;
   }
 
-  private void getMinAndMax() {
-while(iterator.hasNext())  {
-  T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable 
iterable) {
+iterable.forEach(element -> {
   if (max == null) {
 max = element;
-  } else if (max != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)max, 
(Integer)element) < 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)max, 
(Binary)element) < 0) ||
-  (element instanceof Double &&
- ((PrimitiveComparator)comparator).compare((Double)max, 
(Double)element) < 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)max, 
(Float)element) < 0) ||
-  (element instanceof Boolean &&
-((PrimitiveComparator)comparator).compare((Boolean)max, 
(Boolean)element) < 0) ||
-  (element instanceof Long &&
-((PrimitiveComparator)comparator).compare((Long) max, 
(Long)element) < 0))
+  } else if (element != null) {
+if (comparator.compare(max, element) < 0) {

Review comment:
   Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-28 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421253#comment-17421253
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r717354327



##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -18,25 +18,17 @@
  */
 package org.apache.parquet.column;
 
-import java.util.Iterator;
-import java.util.Set;
-
-import org.apache.parquet.io.api.Binary;
 import org.apache.parquet.schema.PrimitiveComparator;
 
 /**
- * This class calculates the max and min values of a Set.
+ * This class calculates the max and min values of an iterable collection.
  */
 public final class MinMax {
-  private PrimitiveComparator comparator;
-  private Iterator iterator;
   private T min = null;
   private T max = null;
 
-  public MinMax(PrimitiveComparator comparator, Iterator iterator) {
-this.comparator = comparator;
-this.iterator = iterator;
-getMinAndMax();
+  public MinMax(PrimitiveComparator comparator, Iterable iterable) {

Review comment:
   I think,  you should get warnings if you use a generic class without the 
generic type. `PrimitiveComparator` or even `Comparator` should work fine.

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -47,43 +39,22 @@ public T getMax() {
 return max;
   }
 
-  private void getMinAndMax() {
-while(iterator.hasNext())  {
-  T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable 
iterable) {
+iterable.forEach(element -> {
   if (max == null) {
 max = element;
-  } else if (max != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)max, 
(Integer)element) < 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)max, 
(Binary)element) < 0) ||
-  (element instanceof Double &&
- ((PrimitiveComparator)comparator).compare((Double)max, 
(Double)element) < 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)max, 
(Float)element) < 0) ||
-  (element instanceof Boolean &&
-((PrimitiveComparator)comparator).compare((Boolean)max, 
(Boolean)element) < 0) ||
-  (element instanceof Long &&
-((PrimitiveComparator)comparator).compare((Long) max, 
(Long)element) < 0))
+  } else if (element != null) {
+if (comparator.compare(max, element) < 0) {

Review comment:
   nit: You may combine the two with an `&&`.

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -47,43 +39,22 @@ public T getMax() {
 return max;
   }
 
-  private void getMinAndMax() {
-while(iterator.hasNext())  {
-  T element = iterator.next();
+  private void getMinAndMax(PrimitiveComparator comparator, Iterable 
iterable) {
+iterable.forEach(element -> {
   if (max == null) {
 max = element;
-  } else if (max != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)max, 
(Integer)element) < 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)max, 
(Binary)element) < 0) ||
-  (element instanceof Double &&
- ((PrimitiveComparator)comparator).compare((Double)max, 
(Double)element) < 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)max, 
(Float)element) < 0) ||
-  (element instanceof Boolean &&
-((PrimitiveComparator)comparator).compare((Boolean)max, 
(Boolean)element) < 0) ||
-  (element instanceof Long &&
-((PrimitiveComparator)comparator).compare((Long) max, 
(Long)element) < 0))
+  } else if (element != null) {
+if (comparator.compare(max, element) < 0) {
   max = element;
+}
   }
   if (min == null) {
 min = element;
-  } else if (min != null && element != null) {
-if ((element instanceof Integer &&
-  ((PrimitiveComparator)comparator).compare((Integer)min, 
(Integer)element) > 0) ||
-  (element instanceof Binary &&
-((PrimitiveComparator)comparator).compare((Binary)min, 
(Binary)element) > 0) ||
-  (element instanceof Double &&
-((PrimitiveComparator)comparator).compare((Double)min, 
(Double)element) > 0) ||
-  (element instanceof Float &&
- ((PrimitiveComparator)comparator).compare((Float)min, 
(Float)element) > 0) ||
-  (element instanceof Boolean &&
-

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420577#comment-17420577
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716474012



##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax {
+  private PrimitiveComparator comparator;
+  private Iterator iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator iterator) {

Review comment:
   I would expect an `Iterable` instead of an `Iterator`. This way the Set 
can be passed directly.

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This class calculates the max and min values of a Set.
+ */
+public final class MinMax {
+  private PrimitiveComparator comparator;
+  private Iterator iterator;
+  private T min = null;
+  private T max = null;
+
+  public MinMax(PrimitiveComparator comparator, Iterator iterator) {
+this.comparator = comparator;
+this.iterator = iterator;
+getMinAndMax();
+  }
+
+  public T getMin() {
+return min;
+  }
+
+  public T getMax() {
+return max;
+  }
+
+  private void getMinAndMax() {
+while(iterator.hasNext())  {
+  T element = iterator.next();
+  if (max == null) {
+max = element;
+  } else if (max != null && element != null) {

Review comment:
   You are already in the else path so do not need to check for `max != 
null`. 

##
File path: parquet-column/src/main/java/org/apache/parquet/column/MinMax.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.column;
+
+import java.util.Iterator;
+import java.util.Set;
+
+import org.apache.parquet.io.api.Binary;
+import org.apache.parquet.schema.PrimitiveComparator;
+
+/**
+ * This 

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420408#comment-17420408
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716278171



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -326,12 +323,27 @@ boolean isNullPage(int pageIndex) {
   // and >= the min value in the IN set, then the page might contain
   // the values in the IN set.
   getBoundaryOrder().ltEq(createValueComparator(max))
-.forEachRemaining((int index) -> matchingIndexes2.add(index));
+.forEachRemaining((int index) -> 
matchingIndexesLessThanMax.add(index));
   getBoundaryOrder().gtEq(createValueComparator(min))
-.forEachRemaining((int index) -> matchingIndexes3.add(index));
-  matchingIndexes2.retainAll(matchingIndexes3);
-  matchingIndexes2.addAll(matchingIndexes1);  // add the matching null 
pages
-  return IndexIterator.filter(getPageCount(), pageIndex -> 
matchingIndexes2.contains(pageIndex));
+.forEachRemaining((int index) -> 
matchingIndexesLargerThanMin.add(index));
+  matchingIndexesLessThanMax.retainAll(matchingIndexesLargerThanMin);
+  IntSet matchingIndex = matchingIndexesLessThanMax;
+  matchingIndex.addAll(matchingIndexesForNull);  // add the matching null 
pages
+  return IndexIterator.filter(getPageCount(), pageIndex -> 
matchingIndex.contains(pageIndex));
+}
+
+private > T getMaxOrMin(boolean isMax, Set 
values) {
+  T res = null;
+  for (T element : values) {
+if (res == null) {
+  res = element;
+} else if (isMax && res != null && element != null && 
res.compareTo(element) < 0) {

Review comment:
   I changed to `PrimitiveComparator`. I checked `instanceof` and then cast 
to use the `PrimitiveComparator` for each of the type. Not sure if this is the 
correct way. Please take a look.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
   return BLOCK_MIGHT_MATCH;
 }
 
+if (stats.isNumNullsSet()) {
+  if (stats.getNumNulls() == 0) {
+if (values.contains(null) && values.size() == 1) return 
BLOCK_CANNOT_MATCH;
+  } else {
+if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+  }
+}
+
 // drop if all the element in value < min || all the element in value > max
-return allElementCanBeDropped(stats, values, meta);
+if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+  stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+  return BLOCK_MIGHT_MATCH;
+}
+else {
+  return BLOCK_CANNOT_MATCH;
+}
   }
 
-  private > Boolean 
allElementCanBeDropped(Statistics stats, Set values, ColumnChunkMetaData 
meta) {
-for (T value : values) {
-  if (value != null) {
-if (stats.compareMinToValue(value) <= 0 && 
stats.compareMaxToValue(value) >= 0)
-  return false;
-  } else {
-// numNulls is not set. We don't know anything about the nulls in this 
chunk
-if (!stats.isNumNullsSet()) {
-  return false;
-}
-if (hasNulls(meta)) {
-  return false;
-}
+  private > T getMaxOrMin(boolean isMax, Set 
values) {

Review comment:
   Done. Thanks!

##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -291,32 +291,29 @@ boolean isNullPage(int pageIndex) {
 @Override
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
-  TreeSet myTreeSet = new TreeSet<>();
-  IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+  IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
   Iterator it = values.iterator();
   while(it.hasNext()) {
 T value = it.next();
-if (value != null) {
-  myTreeSet.add(value);
-} else {
+if (value == null) {
   if (nullCounts == null) {
 // Searching for nulls so if we don't have null related statistics 
we have to return all pages
 return IndexIterator.all(getPageCount());
   } else {
 for (int i = 0; i < nullCounts.length; i++) {
   if (nullCounts[i] > 0) {
-matchingIndexes1.add(i);
+matchingIndexesForNull.add(i);
   }
 }
   }
 }
   }
 
-  IntSet matchingIndexes2 = new IntOpenHashSet();
-  IntSet matchingIndexes3 = new IntOpenHashSet();
+  IntSet matchingIndexesLessThanMax = new IntOpenHashSet();
+  

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420409#comment-17420409
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716278313



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java
##
@@ -146,6 +147,33 @@ public void testAllFilter() throws Exception {
 assertEquals(new ArrayList(), found);
   }
 
+  @Test
+  public void testInFilter() throws Exception {
+BinaryColumn name = binaryColumn("name");
+
+HashSet nameSet = new HashSet<>();
+nameSet.add(Binary.fromString("thing2"));
+nameSet.add(Binary.fromString("thing1"));
+for (int i = 100; i < 200; i++) {
+  nameSet.add(Binary.fromString("p" + i));
+}
+FilterPredicate pred = in(name, nameSet);
+List found = PhoneBookWriter.readFile(phonebookFile, 
FilterCompat.get(pred));
+
+List expectedNames = new ArrayList<>();
+expectedNames.add("thing1");
+expectedNames.add("thing2");
+for (int i = 100; i < 200; i++) {
+  expectedNames.add("p" + i);
+}
+
+assertEquals(expectedNames.get(0), 
((Group)(found.get(0))).getString("name", 0));
+assertEquals(expectedNames.get(1), 
((Group)(found.get(1))).getString("name", 0));
+for (int i = 2; i < 102; i++) {
+  assertEquals(expectedNames.get(i), 
((Group)(found.get(i))).getString("name", 0));
+}

Review comment:
   I added `assert(found.size() == 102)`. Since I have already checked that 
`found` contains `"thing1"`, `"thing2"` and from `"p100"` to `"p199"`, I think 
this assert size is sufficient to check if `found` doesn't contain anything 
else.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420407#comment-17420407
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r716278127



##
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##
@@ -1,14 +1,14 @@
-/* 

Review comment:
   Added. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419350#comment-17419350
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r714956184



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java
##
@@ -186,26 +186,36 @@ private boolean hasNulls(ColumnChunkMetaData column) {
   return BLOCK_MIGHT_MATCH;
 }
 
+if (stats.isNumNullsSet()) {
+  if (stats.getNumNulls() == 0) {
+if (values.contains(null) && values.size() == 1) return 
BLOCK_CANNOT_MATCH;
+  } else {
+if (values.contains(null)) return BLOCK_MIGHT_MATCH;
+  }
+}
+
 // drop if all the element in value < min || all the element in value > max
-return allElementCanBeDropped(stats, values, meta);
+if (stats.compareMinToValue(getMaxOrMin(true, values)) <= 0 &&
+  stats.compareMaxToValue(getMaxOrMin(false, values)) >= 0) {
+  return BLOCK_MIGHT_MATCH;
+}
+else {
+  return BLOCK_CANNOT_MATCH;
+}
   }
 
-  private > Boolean 
allElementCanBeDropped(Statistics stats, Set values, ColumnChunkMetaData 
meta) {
-for (T value : values) {
-  if (value != null) {
-if (stats.compareMinToValue(value) <= 0 && 
stats.compareMaxToValue(value) >= 0)
-  return false;
-  } else {
-// numNulls is not set. We don't know anything about the nulls in this 
chunk
-if (!stats.isNumNullsSet()) {
-  return false;
-}
-if (hasNulls(meta)) {
-  return false;
-}
+  private > T getMaxOrMin(boolean isMax, Set 
values) {

Review comment:
   I don't really like to have a boolean flag for min/max. I also think it 
would be faster if the min/max values would be searched in the same iteration. 
Also, it would be nice if we wouldn't have to copy-paste this method twice. 
What do you think about the following design?
   
   Having a separate class e.g. `MinMax` that have two `T` fields: `min` and 
`max`. This class can be created by passing an `Iterable` and a 
`PrimitiveComparator` arguments. At instantiation `min` and `max` would be 
initialized so directly after creating min and max can be retrieved.

##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -291,32 +291,29 @@ boolean isNullPage(int pageIndex) {
 @Override
 public > PrimitiveIterator.OfInt visit(In in) {
   Set values = in.getValues();
-  TreeSet myTreeSet = new TreeSet<>();
-  IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+  IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
   Iterator it = values.iterator();
   while(it.hasNext()) {
 T value = it.next();
-if (value != null) {
-  myTreeSet.add(value);
-} else {
+if (value == null) {
   if (nullCounts == null) {
 // Searching for nulls so if we don't have null related statistics 
we have to return all pages
 return IndexIterator.all(getPageCount());
   } else {
 for (int i = 0; i < nullCounts.length; i++) {
   if (nullCounts[i] > 0) {
-matchingIndexes1.add(i);
+matchingIndexesForNull.add(i);
   }
 }
   }
 }
   }
 
-  IntSet matchingIndexes2 = new IntOpenHashSet();
-  IntSet matchingIndexes3 = new IntOpenHashSet();
+  IntSet matchingIndexesLessThanMax = new IntOpenHashSet();
+  IntSet matchingIndexesLargerThanMin = new IntOpenHashSet();

Review comment:
   I would suggest using `Greater` instead of `Larger`. That's the usual 
naming hence we have LT (LessThan) and GT (GreaterThan).

##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -287,6 +288,79 @@ boolean isNullPage(int pageIndex) {
   pageIndex -> nullCounts[pageIndex] > 0 || 
matchingIndexes.contains(pageIndex));
 }
 
+@Override
+public > PrimitiveIterator.OfInt visit(In in) {
+  Set values = in.getValues();
+  IntSet matchingIndexesForNull = new IntOpenHashSet();  // for null
+  Iterator it = values.iterator();
+  while(it.hasNext()) {
+T value = it.next();
+if (value == null) {
+  if (nullCounts == null) {
+// Searching for nulls so if we don't have null related statistics 
we have to return all pages
+return IndexIterator.all(getPageCount());
+  } else {
+for (int i = 0; i < nullCounts.length; i++) {
+  if (nullCounts[i] > 0) {
+matchingIndexesForNull.add(i);
+  }
+}
+  

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418870#comment-17418870
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r714379682



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
 @Override
 public > PrimitiveIterator.OfInt visit(In in) {
-  IntSet indexes = getMatchingIndexes(in);
-  return IndexIterator.filter(getPageCount(), indexes::contains);
+  Set values = in.getValues();
+  TreeSet myTreeSet = new TreeSet<>();
+  IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+  Iterator it = values.iterator();
+  while(it.hasNext()) {
+T value = it.next();
+if (value != null) {
+  myTreeSet.add(value);
+} else {
+  if (nullCounts == null) {
+// Searching for nulls so if we don't have null related statistics 
we have to return all pages
+return IndexIterator.all(getPageCount());
+  } else {
+for (int i = 0; i < nullCounts.length; i++) {
+  if (nullCounts[i] > 0) {
+matchingIndexes1.add(i);
+  }
+}
+  }
+}
+  }
+
+  IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
   @gszadovszky Sorry for the delay. I did the following changes:
   1. Removed `TreeSet` to avoid sort the whole set, and added a method to get 
max and min.
   2. Updated `StatisticsFilter` to have the similar range comparison.
   3. I actually didn't change `notIn` because I don't think range checking 
works with `notIn`: if not in the range, `notIn` is true, but if in the range, 
it doesn't mean `notIn` is false.  For example, if we have `1, 2, 3, 6, 7 8, 9, 
10` and the `notIn` predicate has set values `4, 5`, `4, 5` is in the range of 
1 to 10, but `1, 2, 3, 6, 7 8, 9, 10` doesn't contain `4, 5`. In 
`StatisticsFilter` I simply return `BLOCK_MIGHT_MATCH;` for `notIn`. I probably 
should return `IndexIterator.all(getPageCount());` in `ColumnIndexBuilder` to 
be consistent with `StatisticsFilter`.

##
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##
@@ -1,14 +1,14 @@
-/* 

Review comment:
   @gszadovszky I added a test in `TestRecordLevelFilters.java` to test the 
new methods in the generated class 
`IncrementallyUpdatedFilterPredicateBuilder`. I have added tests in 
`TestColumnIndexFiltering` and `TestBloomFiltering` in my original changes. Do 
I need more tests in these two files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415391#comment-17415391
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708974511



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -250,27 +250,16 @@ public Eq(Column column, T value) {
 }
   }
 
-  // base class for In and NotIn
+  // base class for In and NotIn. In is used to filter data based on a list of 
values. NotIn is used to filter data that
+  // are not in the list of values.

Review comment:
   It is a nit but javadoc style comments in java starts with `/**`. If the 
simple `/*` one is used it won't be generated to the javadocs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415324#comment-17415324
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708850491



##
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##
@@ -1,14 +1,14 @@
-/* 

Review comment:
   I will think about how to achieve this. Thanks for the suggestion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415323#comment-17415323
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708850129



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
 @Override
 public > PrimitiveIterator.OfInt visit(In in) {
-  IntSet indexes = getMatchingIndexes(in);
-  return IndexIterator.filter(getPageCount(), indexes::contains);
+  Set values = in.getValues();
+  TreeSet myTreeSet = new TreeSet<>();
+  IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+  Iterator it = values.iterator();
+  while(it.hasNext()) {
+T value = it.next();
+if (value != null) {
+  myTreeSet.add(value);
+} else {
+  if (nullCounts == null) {
+// Searching for nulls so if we don't have null related statistics 
we have to return all pages
+return IndexIterator.all(getPageCount());
+  } else {
+for (int i = 0; i < nullCounts.length; i++) {
+  if (nullCounts[i] > 0) {
+matchingIndexes1.add(i);
+  }
+}
+  }
+}
+  }
+
+  IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
   I will think more about what is a better way to implement this 
`visit(In in)`. 

##
File path: 
parquet-column/src/test/java/org/apache/parquet/internal/filter2/columnindex/TestColumnIndexFilter.java
##
@@ -410,6 +410,7 @@ public void testFiltering() {
 Set set5 = new HashSet<>();
 set5.add(7);
 set5.add(20);
+System.out.println(in(intColumn("column5"), set5).toString());

Review comment:
   Sorry, removed.

##
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##
@@ -67,15 +67,18 @@ public void run() throws IOException {
 add("package org.apache.parquet.filter2.recordlevel;\n" +
 "\n" +
 "import java.util.List;\n" +
+"import java.util.Set;\n" +
 "\n" +
 "import org.apache.parquet.hadoop.metadata.ColumnPath;\n" +
 "import org.apache.parquet.filter2.predicate.Operators.Eq;\n" +
 "import org.apache.parquet.filter2.predicate.Operators.Gt;\n" +
 "import org.apache.parquet.filter2.predicate.Operators.GtEq;\n" +
+  "import org.apache.parquet.filter2.predicate.Operators.In;\n" +

Review comment:
   fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415321#comment-17415321
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708849684



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -250,27 +250,16 @@ public Eq(Column column, T value) {
 }
   }
 
-  // base class for In and NotIn
+  // base class for In and NotIn. In is used to filter data based on a list of 
values. NotIn is used to filter data that
+  // are not in the list of values.

Review comment:
   Changed.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java
##
@@ -118,6 +120,45 @@ private ColumnChunkMetaData getColumnChunk(ColumnPath 
columnPath) {
 return BLOCK_MIGHT_MATCH;
   }
 
+  @Override
+  public > Boolean visit(Operators.In in) {
+Set values = in.getValues();
+
+if (values.contains(null)) {
+  // the bloom filter bitset contains only non-null values so isn't 
helpful. this
+  // could check the column stats, but the StatisticsFilter is responsible
+  return BLOCK_MIGHT_MATCH;
+}
+
+Operators.Column filterColumn = in.getColumn();
+ColumnChunkMetaData meta = getColumnChunk(filterColumn.getColumnPath());
+if (meta == null) {
+  // the column isn't in this file so all values are null, but the value
+  // must be non-null because of the above check.
+  return BLOCK_CANNOT_MATCH;
+}
+
+try {
+  BloomFilter bloomFilter = bloomFilterReader.readBloomFilter(meta);
+  if (bloomFilter != null) {
+for (T value : values) {
+  if (bloomFilter.findHash(bloomFilter.hash(value))) {
+return BLOCK_MIGHT_MATCH;
+  }
+}
+return BLOCK_CANNOT_MATCH;
+  }
+} catch (RuntimeException e) {
+  LOG.warn(e.getMessage());
+}

Review comment:
   Removed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414964#comment-17414964
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r708260743



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
 @Override
 public > PrimitiveIterator.OfInt visit(In in) {
-  IntSet indexes = getMatchingIndexes(in);
-  return IndexIterator.filter(getPageCount(), indexes::contains);
+  Set values = in.getValues();
+  TreeSet myTreeSet = new TreeSet<>();
+  IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+  Iterator it = values.iterator();
+  while(it.hasNext()) {
+T value = it.next();
+if (value != null) {
+  myTreeSet.add(value);
+} else {
+  if (nullCounts == null) {
+// Searching for nulls so if we don't have null related statistics 
we have to return all pages
+return IndexIterator.all(getPageCount());
+  } else {
+for (int i = 0; i < nullCounts.length; i++) {
+  if (nullCounts[i] > 0) {
+matchingIndexes1.add(i);
+  }
+}
+  }
+}
+  }
+
+  IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
   Please choose better naming for the final implementation.

##
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##
@@ -1,14 +1,14 @@
-/* 

Review comment:
   We have to validate In and NotIn for the record level filtering as well. 
All the tests here are validating higher levels only. (Record level filtering 
is when you actually read the data and check the values one by one to return 
only those values that are fulfill the filter. This is done by the class 
`IncrementallyUpdatedFilterPredicateBuilder` generated by this one.) 
   
   In would suggest checking and extending the test classes 
`TestRecordLevelFilters`, `TestColumnIndexFiltering` and `TestBloomFiltering`. 

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java
##
@@ -118,6 +120,45 @@ private ColumnChunkMetaData getColumnChunk(ColumnPath 
columnPath) {
 return BLOCK_MIGHT_MATCH;
   }
 
+  @Override
+  public > Boolean visit(Operators.In in) {
+Set values = in.getValues();
+
+if (values.contains(null)) {
+  // the bloom filter bitset contains only non-null values so isn't 
helpful. this
+  // could check the column stats, but the StatisticsFilter is responsible
+  return BLOCK_MIGHT_MATCH;
+}
+
+Operators.Column filterColumn = in.getColumn();
+ColumnChunkMetaData meta = getColumnChunk(filterColumn.getColumnPath());
+if (meta == null) {
+  // the column isn't in this file so all values are null, but the value
+  // must be non-null because of the above check.
+  return BLOCK_CANNOT_MATCH;
+}
+
+try {
+  BloomFilter bloomFilter = bloomFilterReader.readBloomFilter(meta);
+  if (bloomFilter != null) {
+for (T value : values) {
+  if (bloomFilter.findHash(bloomFilter.hash(value))) {
+return BLOCK_MIGHT_MATCH;
+  }
+}
+return BLOCK_CANNOT_MATCH;
+  }
+} catch (RuntimeException e) {
+  LOG.warn(e.getMessage());
+}

Review comment:
   Why is this necessary? Shouldn't we simply allow throwing though a 
RuntimeExeption from bloom filters?

##
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##
@@ -67,15 +67,18 @@ public void run() throws IOException {
 add("package org.apache.parquet.filter2.recordlevel;\n" +
 "\n" +
 "import java.util.List;\n" +
+"import java.util.Set;\n" +
 "\n" +
 "import org.apache.parquet.hadoop.metadata.ColumnPath;\n" +
 "import org.apache.parquet.filter2.predicate.Operators.Eq;\n" +
 "import org.apache.parquet.filter2.predicate.Operators.Gt;\n" +
 "import org.apache.parquet.filter2.predicate.Operators.GtEq;\n" +
+  "import org.apache.parquet.filter2.predicate.Operators.In;\n" +

Review comment:
   Please correct indention.

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -250,27 +250,16 @@ public Eq(Column column, T value) {
 }
   }
 
-  // base class for In and NotIn
+  // base class for In and NotIn. In is used to filter data based on a list of 
values. NotIn is used to filter data that
+  // are not in the list of values.

Review 

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408020#comment-17408020
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699767560



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
   pageIndex -> nullCounts[pageIndex] > 0 || 
matchingIndexes.contains(pageIndex));
 }
 
+@Override

Review comment:
   @gszadovszky I tried this: if the values in a page are <= the max value 
in the IN set, and >= the min value in the IN set, then the page might contain 
the values in the IN set. I am not sure if this is want you want so I only 
changed `In` for now. Please take a look. Thanks!

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn

Review comment:
   Fixed. Thanks!

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
   Removed.

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Fixed. Thanks!

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";
+}
+
+public Column getColumn() {
+  return column;
+}
+
+public Set getValues() {
+  return values;
+}
+
+@Override
+public String toString() {
+  return toString;
+}
+
+@Override
+public boolean equals(Object o) {
+  if (this == o) return true;
+  if (o == null || getClass() != o.getClass()) return false;

Review comment:
   Yes, but just trying to follow the style at 

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-09-01 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407996#comment-17407996
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699604019



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Is it just enough to replace `str + "..."` to 
`str.append("...").toString`?

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   `str.substring(0, str.length() - 2)` is still `StringBuilder` operation. 
Seems fine?

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Maybe we can replace line 273 with `StringBuilder` operation too?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>  

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407733#comment-17407733
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699767598



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn

Review comment:
   Fixed. Thanks!

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
   Removed.

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Fixed. Thanks!

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";
+}
+
+public Column getColumn() {
+  return column;
+}
+
+public Set getValues() {
+  return values;
+}
+
+@Override
+public String toString() {
+  return toString;
+}
+
+@Override
+public boolean equals(Object o) {
+  if (this == o) return true;
+  if (o == null || getClass() != o.getClass()) return false;

Review comment:
   Yes, but just trying to follow the style at 
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java#L150

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407732#comment-17407732
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699767560



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
   pageIndex -> nullCounts[pageIndex] > 0 || 
matchingIndexes.contains(pageIndex));
 }
 
+@Override

Review comment:
   @gszadovszky I tried this: if the values in a page are <= the max value 
in the IN set, and >= the min value in the IN set, then the page might contain 
the values in the IN set. I am not sure if this is want you want so I only 
changed `In` for now. Please take a look. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407610#comment-17407610
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699605818



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Maybe we can replace line 273 with `StringBuilder` operation too?  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407608#comment-17407608
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699604499



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   `str.substring(0, str.length() - 2)` is still `StringBuilder` operation. 
Seems fine?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407606#comment-17407606
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

viirya commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r699604019



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Is it just enough to replace `str + "..."` to 
`str.append("...").toString`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407440#comment-17407440
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-908849008


   @gszadovszky @shangxinli @dbtsai Thank you all very much for reviewing! I 
have changed the code to generate the visit methods for in/notIn and also added 
the default by throwing Exception. Will address the rest of the comments 
tomorrow or the day after tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407415#comment-17407415
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698937308



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
 }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
   Changed. Thanks!

##
File path: pom.xml
##
@@ -478,6 +478,7 @@
 change to fix a integer overflow issue.
 TODO: remove this after Parquet 1.13 release -->
   
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize
+  
org.apache.parquet.filter2.predicate.FilterPredicate

Review comment:
   Changed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-31 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407416#comment-17407416
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

shangxinli commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698598441



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn

Review comment:
   Have a better comment since it is public method  

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
   I see you have a 'toString' to cache but do we see generally this is 
reused multiple times? If no, proactively converting to string will be a waste. 
 

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Would it be possible to merge lines 272 and 273 into the above code of 
that building? the string? String operations sometimes consume a lot of memory 
like this. 

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";
+}
+
+public Column getColumn() {
+  return column;
+}
+
+public Set getValues() {
+  return values;
+}
+
+@Override
+public String toString() {
+  return toString;
+}
+
+@Override
+public boolean equals(Object o) {
+  if (this == o) return true;
+  if (o == null || getClass() != o.getClass()) return false;

Review comment:
   I guess you can just 'return this.getClass() == o.getClass()'

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> 

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407030#comment-17407030
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-908849008


   @gszadovszky @shangxinli @dbtsai Thank you all very much for reviewing! I 
have changed the code to generate the visit methods for in/notIn and also added 
the default by throwing Exception. Will address the rest of the comments 
tomorrow or the day after tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407029#comment-17407029
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698937308



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
 }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
   Changed. Thanks!

##
File path: pom.xml
##
@@ -478,6 +478,7 @@
 change to fix a integer overflow issue.
 TODO: remove this after Parquet 1.13 release -->
   
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize
+  
org.apache.parquet.filter2.predicate.FilterPredicate

Review comment:
   Changed. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406813#comment-17406813
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

shangxinli commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698606499



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);

Review comment:
   I see you have a 'toString' to cache but do we see generally this is 
reused multiple times? If no, proactively converting to string will be a waste. 
 

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";

Review comment:
   Would it be possible to merge lines 272 and 273 into the above code of 
that building? the string? String operations sometimes consume a lot of memory 
like this. 

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be null");
+  this.values = Objects.requireNonNull(values, "values cannot be null");
+  checkArgument(!values.isEmpty(), "values in SetColumnFilterPredicate 
shouldn't be empty!");
+
+  String name = getClass().getSimpleName().toLowerCase(Locale.ENGLISH);
+  StringBuilder str = new StringBuilder();
+  int iter = 0;
+  for (T value : values) {
+if (iter >= 100) break;
+str.append(value).append(", ");
+iter++;
+  }
+  String valueStr = values.size() <= 100 ? str.substring(0, str.length() - 
2) : str + "...";
+  this.toString = name + "(" + column.getColumnPath().toDotString() + ", " 
+ valueStr + ")";
+}
+
+public Column getColumn() {
+  return column;
+}
+
+public Set getValues() {
+  return values;
+}
+
+@Override
+public String toString() {
+  return toString;
+}
+
+@Override
+public boolean equals(Object o) {
+  if (this == o) return true;
+  if (o == null || getClass() != o.getClass()) return false;

Review comment:
   I guess you can just 'return this.getClass() == o.getClass()'

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn
+  public static abstract class SetColumnFilterPredicate> implements FilterPredicate, Serializable {
+private final Column column;
+private final Set values;
+private final String toString;
+
+protected SetColumnFilterPredicate(Column column, Set values) {
+  this.column = Objects.requireNonNull(column, "column cannot be 

[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406788#comment-17406788
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

shangxinli commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r698598441



##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
##
@@ -247,6 +250,80 @@ public int hashCode() {
 }
   }
 
+  // base class for In and NotIn

Review comment:
   Have a better comment since it is public method  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405496#comment-17405496
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

dbtsai commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r697021579



##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
   pageIndex -> nullCounts[pageIndex] > 0 || 
matchingIndexes.contains(pageIndex));
 }
 
+@Override

Review comment:
   +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17404591#comment-17404591
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r695929108



##
File path: pom.xml
##
@@ -478,6 +478,7 @@
 change to fix a integer overflow issue.
 TODO: remove this after Parquet 1.13 release -->
   
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize
+  
org.apache.parquet.filter2.predicate.FilterPredicate

Review comment:
   However, it is not what filter2 API is designed for, technically it is 
possible to the user to implement this interface and by adding new methods to 
it we really break our API.
   What do you think about adding default implementations by throwing 
`UnsupportedOperationException`? This way we do not need to add this class here.

##
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
 }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
   I know it is quite a mass to add new stuff into 
`IncrementallyUpdatedFilterPredicateBuilder` (via 
`IncrementallyUpdatedFilterPredicateGenerator`) but the current way you are 
checking the values one-by-one while you have a hashset. It could be faster if 
the related visit methods would be generated in 
`IncrementallyUpdatedFilterPredicateBuilder` just like for the other predicates 
and hash search algorithm would be used. What do you think?

##
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
   pageIndex -> nullCounts[pageIndex] > 0 || 
matchingIndexes.contains(pageIndex));
 }
 
+@Override

Review comment:
   I am not sure how it effects performance in real life (e.g. how many 
values are in the set of the in/notIn predicate and how many pages do we have) 
but it can be done in smarter way. Column indexes are min/max values for the 
pages. If we sort the values in the set we can do two logarithmic searches (one 
for min then for max) to decide if a page might contain that value. If the 
column indexes themselves have "sorted" min/max values then we can be even 
faster. (See `BoundaryOrder` for more details.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400154#comment-17400154
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-900018869


   also cc @chenjunjiedada


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17399823#comment-17399823
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao commented on pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#issuecomment-899638598


   @gszadovszky @shangxinli @rdblue Could you please take a look at this PR 
when you have time? Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-08-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17399201#comment-17399201
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-

huaxingao opened a new pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-1968
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-05-25 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351199#comment-17351199
 ] 

Xinli Shang commented on PARQUET-1968:
--

Go ahead to work on it. Thanks Huaxin!

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-05-25 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351193#comment-17351193
 ] 

Huaxin Gao commented on PARQUET-1968:
-

[~shangxinli] [~rdblue] Hi Xinli and Ryan, I am just wondering what has been 
decided about this native IN predicate support in the parquet sync meeting. Has 
somebody started working on this one yet? Thanks!

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276664#comment-17276664
 ] 

Xinli Shang commented on PARQUET-1968:
--

Sure, will connect with you shortly. 

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276548#comment-17276548
 ] 

Ryan Blue commented on PARQUET-1968:


Thank you! I'm not sure why it was no longer on my calendar. I have the invite 
now and I plan to attend the sync on the 23rd. If you'd like, we can also set 
up a time to talk about this integration specifically, since it may take a 
while.

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276533#comment-17276533
 ] 

Xinli Shang commented on PARQUET-1968:
--

Hi [~rdblue]. We didn't discuss it in last week's Parquet sync meeting since 
you were not there.  The next Parquet sync is Feb 23th 9:00am. I just added you 
explicitly with your Netflix email account. Hopefully, you can join. 

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Ryan Blue (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276526#comment-17276526
 ] 

Ryan Blue commented on PARQUET-1968:


I would really like to see a new Parquet API that can support some of the 
additional features we needed for Iceberg. I proposed adopting Iceberg's filter 
expressions a year or two ago, so I'm glad to see that the idea has some 
support from other PMC members. This is one reason why the API is in a separate 
module. I think we were planning to talk about this at the next Parquet sync, 
although I'm not sure when that will be.

FYI [~sha...@uber.com].

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276421#comment-17276421
 ] 

Gabor Szadovszky commented on PARQUET-1968:
---

This one sounds great. Meanwhile, we were talking about the filtering APIs 
between Iceberg and Parquet with [~rdblue]. It seems that Iceberg's API already 
contains this feature and it seems to be more clear and usable than the one 
implemented in Parquet. It might be a good idea to separate this filtering API 
in Iceberg and use/implement it in Parquet. (See 
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/expressions/Expression.java
 for Iceberg's API.)

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)