[
https://issues.apache.org/jira/browse/FLINK-29545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617613#comment-17617613
]
Piotr Nowojski edited comment on FLINK-29545 at 10/14/22 9:55 AM:
------------------------------------------------------------------
It looks like the problem is with the upstream sources. What might be happening
is that downstream sub tasks are waiting for checkpoint barriers from those
remaining sources, that for some reason do not arrive, and this is blocking the
alignment process on the downstream subtasks, blocking the progress of the
whole job. I don't know, maybe this state would be reported as backpressured.
Nevertheless I would dig deeper why, in this screen shot !task acknowledge
na.png |height=200!
, those 4 subtasks haven't finished the checkpoint. You could for example show
thread dump from a task manager that is running one of those source subtasks
(and tell us what is the name/subtask id of the problematic subtask). Given
that you have a custom source, you can also double check if it is implemented
correctly. Especially when it comes to acquisition of the checkpoint lock
and/or {{CheckpointedFunction#snapshotState}} /
{{CheckpointedFunction#initializeState}} methods. I would expect some problem
with your implementation of that source. Can you maybe share the source code of
that source?
was (Author: pnowojski):
It looks like the problem is with the upstream sources. What might be happening
is that downstream sub tasks are waiting for checkpoint barriers from those
remaining sources, that for some reason do not arrive, and this is blocking the
alignment process on the downstream subtasks, blocking the progress of the
whole job. I don't know, maybe this state would be reported as backpressured.
Nevertheless I would dig deeper why, in this screen shot !task acknowledge
na.png |thumbnail, height=200!
, those 4 subtasks haven't finished the checkpoint. You could for example show
thread dump from a task manager that is running one of those source subtasks
(and tell us what is the name/subtask id of the problematic subtask). Given
that you have a custom source, you can also double check if it is implemented
correctly. Especially when it comes to acquisition of the checkpoint lock
and/or {{CheckpointedFunction#snapshotState}} /
{{CheckpointedFunction#initializeState}} methods. I would expect some problem
with your implementation of that source. Can you maybe share the source code of
that source?
> kafka consuming stop when trigger first checkpoint
> --------------------------------------------------
>
> Key: FLINK-29545
> URL: https://issues.apache.org/jira/browse/FLINK-29545
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Network
> Affects Versions: 1.13.3
> Reporter: xiaogang zhou
> Priority: Critical
> Attachments: backpressure 100 busy 0.png, task acknowledge na.png,
> task dag.png
>
>
> the task dag is like attached file. the task is started to consume from
> earliest offset, it will stop when the first checkpoint triggers.
>
> is it normal?, for sink is busy 0 and the second operator has 100 backpressure
>
> and check the checkpoint summary, we can find some of the sub task is n/a.
> I tried to debug this issue and found in the
> triggerCheckpointAsync , the
> triggerCheckpointAsyncInMailbox took a lot time to call
>
>
> looks like this has something to do with
> logCheckpointProcessingDelay, Has any fix on this issue?
>
>
> can anybody help me on this issue?
>
>
>
>
> thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)