[
https://issues.apache.org/jira/browse/FLINK-29545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614666#comment-17614666
]
xiaogang zhou edited comment on FLINK-29545 at 10/10/22 2:01 AM:
-----------------------------------------------------------------
1, yes, I have debug this task for many times, every time consumer stop is when
checkpoint is triggered. this is a EXACTLY ONCE case. I have debug for
AT_LEAST_ONCE, the problem didn't appear.
2, I don't think processor is blocked at logCheckpointProcessingDelay, I
mention it because some subtask can success and display checkpoint duration,
others only shows n/a. (check the attache picture). And I found the success
subtask can call the function
SubtaskCheckpointCoordinatorImpl# checkpointState at the source task in the dag.
but the 'n/a' subtask only call
StreamTask# triggerCheckpointAsync
not sure why the 'checkpointState' did not run by mailbox executor.
And I have 500 taskmanager, it's hard to judge I should dump which one's thread
stack
[~masteryhx]
was (Author: zhoujira86):
1, yes, I have debug this task for many times, every time consumer stop is when
checkpoint is triggered. this is a EXACTLY ONCE case. I have debug for
AT_LEAST_ONCE, the problem didn't appear.
2, I don't think processor is blocked at logCheckpointProcessingDelay, I
mention it because some subtask can success and display checkpoint duration,
others only shows n/a. (check the attache picture). And I found the success
subtask can call the function
SubtaskCheckpointCoordinatorImpl# checkpointState at the source task in the dag.
but the 'n/a' subtask only call
StreamTask# triggerCheckpointAsync
not sure why the 'checkpointState' did not run by mailbox executor.
And I have 500 taskmanager, it's hard to judge I should dump which one's thread
stack
> kafka consuming stop when trigger first checkpoint
> --------------------------------------------------
>
> Key: FLINK-29545
> URL: https://issues.apache.org/jira/browse/FLINK-29545
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Network
> Affects Versions: 1.13.3
> Reporter: xiaogang zhou
> Priority: Critical
> Attachments: backpressure 100 busy 0.png, task acknowledge na.png,
> task dag.png
>
>
> the task dag is like attached file. when the task is started to consume from
> earliest offset, it will stop when the first checkpoint triggers.
>
> is it normal?, for sink is busy 0 and the second operator has 100 backpressure
>
> and check the checkpoint summary, we can find some of the sub task is n/a.
> I tried to debug this issue and found in the
> triggerCheckpointAsync , the
> triggerCheckpointAsyncInMailbox took a lot time to call
>
>
> looks like this has something to do with
> logCheckpointProcessingDelay, Has any fix on this issue?
>
>
> can anybody help me on this issue?
>
> thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)