[ https://issues.apache.org/jira/browse/HADOOP-19479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951137#comment-17951137 ]
Sarunas Valaskevicius commented on HADOOP-19479: ------------------------------------------------ hi, is there a workaround we could use? e.g. can we adjust the timeout to not close and cause the deadlock? while the risk of S3 (or network routes to it) being not available is fairly low, the impact of this is rather high as it freezes the whole system and does not recover after the network recovers > S3A Deadlock in multipart upload > -------------------------------- > > Key: HADOOP-19479 > URL: https://issues.apache.org/jira/browse/HADOOP-19479 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 3.4.1 > Reporter: Sarunas Valaskevicius > Priority: Major > > Reproduced while testing system resilience and turning S3 network off > (introduced a network partition to the list of IP addresses S3 uses) - but > given it's seemingly timers related stack traces, I'd guess it could happen > any time? > {code:java} > Found one Java-level deadlock: > ============================= > "sdk-ScheduledExecutor-2-3": > waiting to lock monitor 0x00007f5c880a8630 (object 0x0000000315523c78, a > java.lang.Object), > which is held by "sdk-ScheduledExecutor-2-4" > "sdk-ScheduledExecutor-2-4": > waiting to lock monitor 0x00007f5c7c016700 (object 0x0000000327800000, a > org.apache.hadoop.fs.s3a.S3ABlockOutputStream), > which is held by "io-compute-blocker-15" > "io-compute-blocker-15": > waiting to lock monitor 0x00007f5c642ae900 (object 0x00000003af0001d8, a > java.lang.Object), > which is held by "sdk-ScheduledExecutor-2-3" > Java stack information for the threads listed above: > =================================================== > "sdk-ScheduledExecutor-2-3": > at java.lang.Thread.interrupt(java.base@21/Thread.java:1717) > - waiting to lock <0x0000000315523c78> (a java.lang.Object) > at > software.amazon.awssdk.core.internal.http.timers.SyncTimeoutTask.run(SyncTimeoutTask.java:60) > - locked <0x00000003af0001d8> (a java.lang.Object) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@21/Executors.java:572) > at > java.util.concurrent.FutureTask.run(java.base@21/FutureTask.java:317) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@21/ScheduledThreadPoolExecutor.java:304) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21/ThreadPoolExecutor.java:1144) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21/ThreadPoolExecutor.java:642) > at java.lang.Thread.runWith(java.base@21/Thread.java:1596) > at java.lang.Thread.run(java.base@21/Thread.java:1583) > "sdk-ScheduledExecutor-2-4": > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.getActiveBlock(S3ABlockOutputStream.java:304) > - waiting to lock <0x0000000327800000> (a > org.apache.hadoop.fs.s3a.S3ABlockOutputStream) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:485) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) > at > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:66) > at > java.nio.channels.Channels$WritableByteChannelImpl.implCloseChannel(java.base@21/Channels.java:404) > at > java.nio.channels.spi.AbstractInterruptibleChannel$1.interrupt(java.base@21/AbstractInterruptibleChannel.java:163) > - locked <0x00000003af0002a0> (a java.lang.Object) > at java.lang.Thread.interrupt(java.base@21/Thread.java:1722) > - locked <0x0000000315523c78> (a java.lang.Object) > at > software.amazon.awssdk.core.internal.http.timers.SyncTimeoutTask.run(SyncTimeoutTask.java:60) > - locked <0x00000003af0002e0> (a java.lang.Object) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@21/Executors.java:572) > at > java.util.concurrent.FutureTask.run(java.base@21/FutureTask.java:317) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@21/ScheduledThreadPoolExecutor.java:304) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21/ThreadPoolExecutor.java:1144) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21/ThreadPoolExecutor.java:642) > at java.lang.Thread.runWith(java.base@21/Thread.java:1596) > at java.lang.Thread.run(java.base@21/Thread.java:1583) > "io-compute-blocker-15": > at > software.amazon.awssdk.core.internal.http.timers.SyncTimeoutTask.cancel(SyncTimeoutTask.java:74) > - waiting to lock <0x00000003af0001d8> (a java.lang.Object) > at > software.amazon.awssdk.core.internal.http.timers.ApiCallTimeoutTracker.cancel(ApiCallTimeoutTracker.java:53) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:77) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) > at > software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) > at > software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) > at > software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32) > at > software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) > at > software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) > at > software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) > at > software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224) > at > software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) > at > software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173) > at > software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80) > at > software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler$$Lambda/0x00007f5d2cb20ca8.get(Unknown > Source) > at > software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182) > at > software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74) > at > software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) > at > software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53) > at > software.amazon.awssdk.services.s3.DefaultS3Client.createMultipartUpload(DefaultS3Client.java:1463) > at > software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$createMultipartUpload$4(DelegatingS3Client.java:1232) > at > software.amazon.awssdk.services.s3.DelegatingS3Client$$Lambda/0x00007f5d2d316118.apply(Unknown > Source) > at > software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:67) > at > software.amazon.awssdk.services.s3.DelegatingS3Client.createMultipartUpload(DelegatingS3Client.java:1232) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$initiateMultipartUpload$30(S3AFileSystem.java:4705) > at > org.apache.hadoop.fs.s3a.S3AFileSystem$$Lambda/0x00007f5d2d315ef8.get(Unknown > Source) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfSupplier(IOStatisticsBinding.java:651) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.initiateMultipartUpload(S3AFileSystem.java:4703) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$initiateMultiPartUpload$0(WriteOperationHelper.java:283) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda/0x00007f5d2d30e230.apply(Unknown > Source) > at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122) > at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376) > at > org.apache.hadoop.fs.s3a.Invoker$$Lambda/0x00007f5d2d2dd6a0.apply(Unknown > Source) > at > org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468) > at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372) > at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper.java:207) > at > org.apache.hadoop.fs.s3a.WriteOperationHelper.initiateMultiPartUpload(WriteOperationHelper.java:278) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.lambda$new$0(S3ABlockOutputStream.java:904) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload$$Lambda/0x00007f5d2d30e000.apply(Unknown > Source) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding$$Lambda/0x00007f5d2ca3c918.apply(Unknown > Source) > at > org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.<init>(S3ABlockOutputStream.java:902) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.initMultipartUpload(S3ABlockOutputStream.java:462) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.uploadCurrentBlock(S3ABlockOutputStream.java:439) > - locked <0x0000000327800000> (a > org.apache.hadoop.fs.s3a.S3ABlockOutputStream) > at > org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:413) > - locked <0x0000000327800000> (a > org.apache.hadoop.fs.s3a.S3ABlockOutputStream) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62) > at > java.io.DataOutputStream.write(java.base@21/DataOutputStream.java:115) > - locked <0x0000000327800208> (a > org.apache.hadoop.fs.FSDataOutputStream) > at > org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(java.base@21/Channels.java:392) > - locked <0x00000003afab3da8> (a java.lang.Object) > at > org.apache.parquet.bytes.ConcatenatingByteBufferCollector.writeAllTo(ConcatenatingByteBufferCollector.java:77) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeColumnChunk(ParquetFileWriter.java:1338) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeColumnChunk(ParquetFileWriter.java:1259) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:408) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:675) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:210) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:178) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:154) > at > org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:428) > Found 1 deadlock. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org