[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

Yin Huai (JIRA) Mon, 14 Sep 2015 11:55:10 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744026#comment-14744026
 ]


Yin Huai commented on SPARK-9899:
---------------------------------

https://github.com/apache/spark/pull/8687 adds a warning message to places 
where we save data through RDD's API and we save data to Hive for avoiding of 
using direct output committer when speculation is enabled. This change will be 
included in 1.6.

> JSON/Parquet writing on retry or speculation broken with direct output 
> committer
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-9899
>                 URL: https://issues.apache.org/jira/browse/SPARK-9899
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Michael Armbrust
>            Assignee: Cheng Lian
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> If the first task fails all subsequent tasks will.  We probably need to set a 
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
>       at 
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
>       at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JSONRelation.scala:185)
>       at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
>       at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}
> The reason behind this issue is that speculation shouldn't be used together 
> with direct output committer. As there are multiple corner cases that this 
> combination may cause data corruption and/or data loss. Please refer to this 
> [GitHub 
> comment|https://github.com/apache/spark/pull/8191#issuecomment-131598385] for 
> more details about these corner cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

Reply via email to