[
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165649#comment-17165649
]
Apache Spark commented on SPARK-27194:
--------------------------------------
User 'WinkerDu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29260
> Job failures when task attempts do not clean up spark-staging parquet files
> ---------------------------------------------------------------------------
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.3.1, 2.3.2, 2.3.3
> Reporter: Reza Safi
> Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for
> exceeding memory limits), the subsequent task attempts for the tasks that
> were running on that container all fail with a FileAlreadyExistsException.
> The original task attempt does not seem to successfully call abortTask (or at
> least its "best effort" delete is unsuccessful) and clean up the parquet file
> it was writing to, so when later task attempts try to write to the same
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93
> exited caused by one of the running tasks) Reason: Container killed by YARN
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task
> failed while writing rows.
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
> for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures)
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task
> failed while writing rows.
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
> ...
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
> for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage
> reattempt.
> This issue seems to happen when
> spark.sql.sources.partitionOverwriteMode=dynamic.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]