[
https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215091#comment-17215091
]
James Yu commented on HADOOP-17201:
-----------------------------------
We have this "never ending last task" issue a lot too, except that we used the
default s3a committer, not the magic committer. When it happened, they mostly
have stacktrace (see below) similar to the one reported in this ticket. Also we
were sure there was no significant data skew when this happened.
{code:java}
s3a-transfer-shared-pool3-t5 (TIMED_WAITING)java.lang.Thread.sleep(Native
Method)
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1416)
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1676)
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2738)
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2704)
org.apache.hadoop.fs.s3a.S3AFileSystem.putObjectDirect(S3AFileSystem.java:1548)
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$putObject$5(WriteOperationHelper.java:430)
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$1846/182096913.execute(Unknown
Source)
org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
org.apache.hadoop.fs.s3a.Invoker$$Lambda$466/1761475153.execute(Unknown Source)
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper.java:123)
org.apache.hadoop.fs.s3a.WriteOperationHelper.putObject(WriteOperationHelper.java:428)
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.lambda$putObject$0(S3ABlockOutputStream.java:438)
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$$Lambda$1845/1094082085.call(Unknown
Source)
...{code}
We are not able to understand the true root cause of this issue. Looks like
this issue is not specific to any flavor of s3a committers.
> Spark job with s3acommitter stuck at the last stage
> ---------------------------------------------------
>
> Key: HADOOP-17201
> URL: https://issues.apache.org/jira/browse/HADOOP-17201
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.2.1
> Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
> Reporter: Dyno
> Priority: Major
> Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log,
> exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log
>
>
> usually our spark job took 1 hour or 2 to finish, occasionally it runs for
> more than 3 hour and then we know it's stuck and usually the executor has
> stack like this
> {{
> "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0
> tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
> at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
> at
> org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
> at
> org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
> at
> org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown
> Source)
> at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
> at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
> at
> org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
> at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
> at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
> at
> org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
> at
> org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
> at
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
> at
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
> at
> org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
> at
> org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
> at
> org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
> at
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
> at
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> at
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
> at
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - <0x00000003a57332e0> (a
> java.util.concurrent.ThreadPoolExecutor$Worker)
> }}
> captured jstack on the stuck executors in case it's useful.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]