Josh Rosen created SPARK-27542:
----------------------------------
Summary: SparkHadoopWriter doesn't set call setWorkOutputPath,
causing NPEs for some legacy OutputFormats
Key: SPARK-27542
URL: https://issues.apache.org/jira/browse/SPARK-27542
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.4.0
Reporter: Josh Rosen
In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} after
configuring the output committer:
[https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611]
Spark doesn't do this:
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115]
As a result, certain legacy output formats can fail to work out-of-the-box on
Spark. In particular,
{{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail
with NullPointerExceptions, e.g.
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:105)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at
org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
[...]
at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
{code}
It looks like someone on GitHub has hit the same problem:
https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe
Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348
We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure
of whether that change would pose compatibility risks for other existing
workloads, though.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]