[
https://issues.apache.org/jira/browse/FLINK-28030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yun Tang closed FLINK-28030.
----------------------------
Resolution: Duplicate
> Checkpoint always hangs when running some jobs
> ----------------------------------------------
>
> Key: FLINK-28030
> URL: https://issues.apache.org/jira/browse/FLINK-28030
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.14.3
> Reporter: Pauli Gandhi
> Priority: Major
>
> We have noticed that Flink jobs hangs and eventually times out after 2 hours
> every time at the first checkpoint after it completes 15/23 acknowledgments
> (65%). There is no cpu activity but yet there are number of tasks reporting
> 100% back pressure. It is peculiar to this job and slight modifications to
> this job. We have created many Flink jobs in the past and never encountered
> the issue.
> Here are the things we tried to narrow down the problem
> * The job runs fine if checkpointing is disabled.
> * Increasing the number of task managers and parallelism to 2 seems to help
> the job complete. However, it stalled again when we sent a larger data set.
> * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but
> didn't help.
> * Sometimes restarting the job manager helps but at other times not.
> * Breaking up the job into smaller parts helps the job to finish.
> * Analyzed the the thread dump and it appears all threads are either in
> sleeping or wait state.
> Here are the environment details
> * Flink version 1.14.3
> * Running Kubernetes
> * Using RocksDB state backend.
> * Checkpoint storage is S3 storage using the Presto library
> * Exactly Once Semantics with unaligned checkpoints enabled.
> * Checkpoint timeout 2 hours
> * Maximum concurrent checkpoints is 1
> * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
> * Using Kafka for input and output
> I have attached the task manager logs, thread dump, and screen shots of the
> job graph and stalled checkpoint.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)