Martin Andersson created SPARK-49812:
----------------------------------------

             Summary: NPE when reading empty zstd compressed csv file
                 Key: SPARK-49812
                 URL: https://issues.apache.org/jira/browse/SPARK-49812
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.3
         Environment: Ubuntu 22.04

Java 17

spark 3.5.3
            Reporter: Martin Andersson


Reading an empty zstd compressed csv file results in a NPE. The same file works 
fine when not compressed.
{code:sh}
$touch empty.csv
$zstd < empty.csv > empty.csv.zst
{code}
This works as expected - resulting in an empty DataFrame.
{code:java}
spark.read()
.option("header", "false")
.option("lineSep", "|")
.option("multiLine", "true")
.option("quote", "")
.schema("some_col string, other_col string")
.csv("empty.csv")
.show();
{code}
 Changing the path to "empty.csv.zst" triggers an exception. The exception is 
only trigger for zstd files when both properties "multiLine" and "quote" are 
set.
{code:java}
INFO DAGScheduler: ResultStage 0 (show at Main.java:24) failed in 0.408 s due 
to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.32.18 executor 
driver): org.apache.spark.SparkException: Encountered error while reading file 
file:///tmp/empty.csv.zst. Details:
        at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:864)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.spark.unsafe.types.UTF8String.toString()" because "currentInput" is 
null
        at 
org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:333)
        at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseStream$1(UnivocityParser.scala:400)
        at 
org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
        at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseStream$3(UnivocityParser.scala:409)
        at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:432)
        at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:587)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
        at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
        ... 22 more
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to