[
https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved SPARK-28371.
-----------------------------------
Resolution: Fixed
Fix Version/s: 2.4.4
3.0.0
Issue resolved by pull request 25140
[https://github.com/apache/spark/pull/25140]
> Parquet "starts with" filter is not null-safe
> ---------------------------------------------
>
> Key: SPARK-28371
> URL: https://issues.apache.org/jira/browse/SPARK-28371
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Marcelo Vanzin
> Assignee: Marcelo Vanzin
> Priority: Major
> Fix For: 3.0.0, 2.4.4
>
>
> I ran into this when running unit tests with Parquet 1.11. It seems that 1.10
> has the same behavior in a few places but Spark somehow doesn't trigger those
> code paths.
> Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's
> implementation is not. This was clarified in Parquet's documentation in
> PARQUET-1489.
> Failure I was getting:
> {noformat}
> Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most
> recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor
> driver): java.lang.NullPointerException

> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)

> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

> at
> org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)

> at
> org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)

> at
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)

> at
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)

> at
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)

> at
> org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)

> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)

> at
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)

> at
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)

> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)

> ...
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]