[
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744026#comment-14744026
]
Yin Huai commented on SPARK-9899:
---------------------------------
https://github.com/apache/spark/pull/8687 adds a warning message to places
where we save data through RDD's API and we save data to Hive for avoiding of
using direct output committer when speculation is enabled. This change will be
included in 1.6.
> JSON/Parquet writing on retry or speculation broken with direct output
> committer
> --------------------------------------------------------------------------------
>
> Key: SPARK-9899
> URL: https://issues.apache.org/jira/browse/SPARK-9899
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Michael Armbrust
> Assignee: Cheng Lian
> Priority: Blocker
> Fix For: 1.5.0
>
>
> If the first task fails all subsequent tasks will. We probably need to set a
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
> at
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
> at
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JSONRelation.scala:185)
> at
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}
> The reason behind this issue is that speculation shouldn't be used together
> with direct output committer. As there are multiple corner cases that this
> combination may cause data corruption and/or data loss. Please refer to this
> [GitHub
> comment|https://github.com/apache/spark/pull/8191#issuecomment-131598385] for
> more details about these corner cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]