[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132746#comment-17132746 ] Dyno commented on HADOOP-17063: --- > important: you are using the classic FileOutputCommitter; this is slow and > unsafe on s3. Can you move to the S3a zero rename committer? {noformat} fs.s3a.committer.name Committer directory directory staging committer partitioned partition staging committer (for use in Spark only) magic the “magic” committer file the original and unsafe File committer; (default) {noformat} our setup is kubernetes/ spark 2.4.4 /hadoop-3.2.1 according to https://hadoop.apache.org/docs/r3.2.1/hadoop-aws/tools/hadoop-aws/committers.html https://github.com/aws-samples/eks-spark-benchmark/blob/master/performance/s3.md directory/partitioned needs shared storage, magic is only supported in spark 3.0 so i think the only option for us is file. when you say "S3a zero rename committer" you mean the directory one? > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132581#comment-17132581 ] Steve Loughran commented on HADOOP-17063: - stack 40 is the same; stack 34 is the one blocking in close() waiting for a task to complete, again in a delete objects call. {code} java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457) at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785) at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751) at org.apache.hadoop.fs.s3a.S3AFileSystem.putObjectDirect(S3AFileSystem.java:1589) at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$putObject$5(WriteOperationHelper.java:430) at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$183/1533480417.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265) at org.apache.hadoop.fs.s3a.Invoker$$Lambda$22/633457182.execute(Unknown Source) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236) at org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper {code} looks like retry round retry (bad); inner call may fail and the the retry count is taking forever. hypothesis: something is happening in that delete objects call (permissions?) set the s3a retrycount (I forget its name) to 1 and see if things now fail fast. then we can look at the underlying problem > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) >
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132574#comment-17132574 ] Steve Loughran commented on HADOOP-17063: - jstack 74, delete objects is blocking. maybe something is waiting for an http connection. add more important: you are using the classic FileOutputCommitter; this is slow and unsafe on s3. Can you move to the S3a zero rename committer? > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132571#comment-17132571 ] Steve Loughran commented on HADOOP-17063: - thanks for the stack. regarding the test instructions, I meant "try those settings for your own job" > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129802#comment-17129802 ] Dyno commented on HADOOP-17063: --- it happens again i attached the jstack. thanks for looking into it. i was trying to implement the change you have suggested but the test instruction does not looks quite clear. > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > Attachments: jstack_exec-34.log, jstack_exec-40.log, > jstack_exec-74.log > > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125838#comment-17125838 ] Steve Loughran commented on HADOOP-17063: - note: if you supply the patch to S3ABlockOutputStream.putObject() to just do the upload in the current thread as a github PR and run the hadoop-aws integration tests as we document (https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md) , I'll merge it in and backport into 3.2.x. That bit of code hasn't changed for years so backporting will be easy. And i won't expect a new test because there's little which can be added > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Minor > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125823#comment-17125823 ] Steve Loughran commented on HADOOP-17063: - Looking at the relevant code, we are scheduling the PUT operation into a semaphore-guarded bounded thread pool. It is deadlocking, then it may be that something else has exhausted that thread pool and is blocking. Also looking at the relevant code, we don't need to execute the put in a separate thread -because it straight after submitting we wait for the result. It is different for big files where we do multipart uploads, as that we do want to upload data blocks while the worker threads generate more data. Try setting fs.s3a.max.total.tasks to something big; the default is 32. Try 2x the number of workers you have; update fs.s3a.connection.maximum to match. And see https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/spark-perf.html for a broader set of suggestions -but do know that ASF spark doesn't ship with the s3a committer integration. Your job commits will be slow and potentially at risk of failing intermittently with 404s on rename. If you do see the problem again -can you get a full jstack thread dump and attach it to this JIRA? thanks. > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Major > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > {code} > > we are using spark 2.4.4 with hadoop 3.2.1 on kubernetes/spark-operator, > sometimes we see this hang with the stacktrace above. it looks like the > putObject never return, we have to kill the executor to make the job move > forward. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Commented] (HADOOP-17063) S3ABlockOutputStream.putObject looks stuck and never timeout
[ https://issues.apache.org/jira/browse/HADOOP-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125413#comment-17125413 ] Dyno commented on HADOOP-17063: --- log in the spark executor, Jun 3, 2020 @ 22:57:23.032 2020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. Jun 3, 2020 @ 22:57:23.0322020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...105 Jun 3, 2020 @ 22:57:23.0322020-06-03 22:57:23,032 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. Jun 3, 2020 @ 22:57:22.9732020-06-03 22:57:22,973 INFO util.ShutdownHookManager: Deleting directory /var/data/spark-ff208630-1fc7-48bc-93a1-6bdf94921c64/spark-eb4613f4-a41a-4985-845b-34b58ae95c50 Jun 3, 2020 @ 22:57:22.9722020-06-03 22:57:22,972 INFO util.ShutdownHookManager: Shutdown hook called Jun 3, 2020 @ 22:57:22.964 2020-06-03 22:57:22,964 INFO storage.DiskBlockManager: Shutdown hook called Jun 3, 2020 @ 22:57:22.960 2020-06-03 22:57:22,960 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM <-- kill to make it move forward Jun 3, 2020 @ 21:52:40.5502020-06-03 21:52:40,549 INFO executor.Executor: Finished task 2927.0 in stage 12.0 (TID 31771). 4696 bytes result sent to driver Jun 3, 2020 @ 21:52:40.5412020-06-03 21:52:40,541 INFO output.FileOutputCommitter: Saved output of task 'attempt_20200603213232_0012_m_002927_31771' to s3a://com Jun 3, 2020 @ 21:52:40.5412020-06-03 21:52:40,541 INFO mapred.SparkHadoopMapRedUtil: attempt_20200603213232_0012_m_002927_31771: Committed Jun 3, 2020 @ 21:52:34.9722020-06-03 21:52:34,971 INFO executor.Executor: Finished task 2922.0 in stage 12.0 (TID 31766). 4696 bytes result sent to driver Jun 3, 2020 @ 21:52:34.9632020-06-03 21:52:34,962 INFO output.FileOutputCommitter: Saved out ... > S3ABlockOutputStream.putObject looks stuck and never timeout > > > Key: HADOOP-17063 > URL: https://issues.apache.org/jira/browse/HADOOP-17063 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.2.1 > Environment: hadoop 3.2.1 > spark 2.4.4 > >Reporter: Dyno >Priority: Major > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:523) > com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:82) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.putObject(S3ABlockOutputStream.java:446) > > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:365) > > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101) > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64) > org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685) > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122) > > org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165) > > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57) > > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > org.apache.spark.scheduler.Task.run(Task.scala:123) > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) >