Github user wangyum commented on a diff in the pull request:
https://github.com/apache/spark/pull/21603#discussion_r197011649
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
---
@@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate:
Boolean) {
case sources.Not(pred) =>
createFilter(schema, pred).map(FilterApi.not)
+ case sources.In(name, values) if canMakeFilterOn(name) &&
values.length < 20 =>
--- End diff --
The threshold is **20**. Too many `values` may be OOM, for example:
```scala
spark.range(10000000).coalesce(1).write.option("parquet.block.size",
1048576).parquet("/tmp/spark/parquet/SPARK-17091")
val df = spark.read.parquet("/tmp/spark/parquet/SPARK-17091/")
df.where(s"id in(${Range(1, 10000).mkString(",")})").count
```
```
Exception in thread "SIGINT handler" 18/06/21 13:00:54 WARN TaskSetManager:
Lost task 7.0 in stage 1.0 (TID 8, localhost, executor driver):
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at
org.apache.parquet.filter2.predicate.Operators$BinaryLogicalFilterPredicate.<init>(Operators.java:263)
at
org.apache.parquet.filter2.predicate.Operators$Or.<init>(Operators.java:316)
at
org.apache.parquet.filter2.predicate.FilterApi.or(FilterApi.java:261)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276)
...
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]