[
https://issues.apache.org/jira/browse/FLINK-32230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728452#comment-17728452
]
Ahmed Hamdy commented on FLINK-32230:
-------------------------------------
Hi [~antoniovespoli] thanks for reporting the issue.
I will be looking into it as early as next week. As you mentioned these form
of deadlocks could be caused due to silent failures of the sdk client when
submitting the records.
while I understand this is a hard case to reproduce, It would be great if you
could update ticket with any info found in operation.
> Deadlock in AWS Kinesis Data Streams AsyncSink connector
> --------------------------------------------------------
>
> Key: FLINK-32230
> URL: https://issues.apache.org/jira/browse/FLINK-32230
> Project: Flink
> Issue Type: Bug
> Components: Connectors / AWS
> Affects Versions: 1.15.4, 1.16.2, 1.17.1
> Reporter: Antonio Vespoli
> Priority: Major
> Fix For: aws-connector-3.1.0, aws-connector-4.2.0
>
>
> Connector calls to AWS Kinesis Data Streams can hang indefinitely without
> making any progress.
> We suspect the root cause to be related to the SDK handling of exceptions,
> similarly to what observed in FLINK-31675.
> We identified this deadlock on applications running on AWS Kinesis Data
> Analytics using the AWS Kinesis Data Streams AsyncSink (with AWS SDK version
> 2.20.32 as per FLINK-31675). The deadlock scenario is still the same as
> described in FLINK-31675. However, the Netty content-length exception does
> not appear when using the updated SDK version.
> This issue only occurs for applications and streams in the AWS regions
> _ap-northeast-3_ and {_}us-gov-east-1{_}. We did not observe this issue in
> any other AWS region.
> The issue happens sporadically and unpredictably. As per its nature, we do
> not have instructions for reproducing it.
> Example of flame-graphs observed when the issue occurs:
> {code:java}
> root
> java.lang.Thread.run:829
> org.apache.flink.runtime.taskmanager.Task.run:568
> org.apache.flink.runtime.taskmanager.Task.doRun:746
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke:932
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring:953
> org.apache.flink.runtime.taskmanager.Task$$Lambda$1253/0x0000000800ecbc40.run:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke:753
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop:804
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop:203
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$907/0x0000000800bf7840.runDefaultAction:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput:519
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput:65
> org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext:110
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext:159
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent:181
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier:231
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState:262
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$1586/0x00000008012c5c40.apply:-1
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2:234
> org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived:66
> org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint:74
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint:493
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100:64
> org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint:287
> org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint:147
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier:1198
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint:1241
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing:50
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1453/0x000000080128e840.run:-1
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12:1253
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState:300
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier:89
> org.apache.flink.streaming.runtime.operators.sink.SinkWriterOperator.prepareSnapshotPreBarrier:165
> org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.flush:494
> org.apache.flink.connector.base.sink.writer.AsyncSinkWriter.yieldIfThereExistsInFlightRequests:503
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxExecutorImpl.yield:84
> org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take:149
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await:2211
> java.util.concurrent.locks.LockSupport.parkNanos:234
> jdk.internal.misc.Unsafe.park:-2 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)