[ https://issues.apache.org/jira/browse/SPARK-28371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin updated SPARK-28371: ----------------------------------- Description: I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 has the same behavior in a few places but Spark somehow doesn't trigger those code paths. Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's implementation is not. This was clarified in Parquet's documentation in PARQUET-1489. Failure I was getting: {noformat} Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor driver): java.lang.NullPointerException
 at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
 at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
 at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
 at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
 at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
 at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
 at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
 at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
 at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
 ... {noformat} was: I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 has the same behavior in a few places but Spark somehow doesn't trigger those code paths. Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's implementation is not. This was clarified in Parquet's documentation in PARQUET-1489. Failure I was getting: {noformat} Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor driver): java.lang.NullPointerException
 at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
 at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 at org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
 at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
 at org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
 at org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
 at org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
 at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
 at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
 at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
 at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
 at ... {noformat} > Parquet "starts with" filter is not null-safe > --------------------------------------------- > > Key: SPARK-28371 > URL: https://issues.apache.org/jira/browse/SPARK-28371 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Marcelo Vanzin > Priority: Major > > I ran into this when running unit tests with Parquet 1.11. It seems that 1.10 > has the same behavior in a few places but Spark somehow doesn't trigger those > code paths. > Basically, {{UserDefinedPredicate.keep}} should be null-safe, and Spark's > implementation is not. This was clarified in Parquet's documentation in > PARQUET-1489. > Failure I was getting: > {noformat} > Job aborted due to stage failure: Task 0 in stage 1304.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 1304.0 (TID 2528, localhost, executor > driver): java.lang.NullPointerException
 > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:544)
 > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$16$$anon$1.keep(ParquetFilters.scala:523)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:152)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 > at > org.apache.parquet.filter2.predicate.Operators$UserDefined.accept(Operators.java:377)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:181)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:56)
 > at > org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:309)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:86)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:81)
 > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:137)
 > at > org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
 > at > org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:954)
 > at > org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:759)
 > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:207)
 > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
 > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
 > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:439)
 > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org