[ 
https://issues.apache.org/jira/browse/FLINK-28032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pauli Gandhi updated FLINK-28032:
---------------------------------
    Description: 
We have noticed that Flink jobs hangs and eventually times out after 2 hours 
every time at the first checkpoint after it completes 15/23(65%) 
acknowledgments.  There is no cpu/record processing activity but yet there are 
a number of tasks reporting 100% back pressure.  It is peculiar to this job and 
slight modifications to this job.  We have created many Flink jobs in the past 
and never encountered the issue.  

Here are the things we tried to narrow down the problem
 * The job runs fine if checkpointing is disabled.
 * Increasing the number of task managers and parallelism to 2 seems to help 
the job complete.  However, it stalled again when we sent a larger data set.
 * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but 
didn't help.
 * Sometimes restarting the job manager helps but at other times not.
 * Breaking up the job into smaller parts helps the job to finish.
 * Analyzed the the thread dump and it appears all threads are either in 
sleeping or wait state.

I have attached the task manager logs (including debug logs for checkpointing), 
thread dump, and screen shots of the job graph and stalled checkpoint.

Your help in resolving this issue is greatly appreciated.

  was:
We have noticed that Flink jobs hangs and eventually times out after 2 hours 
every time at the first checkpoint after it completes 15/23(65%) 
acknowledgments.  There is no cpu/record processing activity but yet there are 
a number of tasks reporting 100% back pressure.  It is peculiar to this job and 
slight modifications to this job.  We have created many Flink jobs in the past 
and never encountered the issue.  

Here are the things we tried to narrow down the problem
 * The job runs fine if checkpointing is disabled.
 * Increasing the number of task managers and parallelism to 2 seems to help 
the job complete.  However, it stalled again when we sent a larger data set.
 * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but 
didn't help.
 * Sometimes restarting the job manager helps but at other times not.
 * Breaking up the job into smaller parts helps the job to finish.
 * Analyzed the the thread dump and it appears all threads are either in 
sleeping or wait state.

I have attached the task manager logs, thread dump, and screen shots of the job 
graph and stalled checkpoint.

Your help in resolving this issue is greatly appreciated.


> Flink checkpointing hangs and times out with some jobs
> ------------------------------------------------------
>
>                 Key: FLINK-28032
>                 URL: https://issues.apache.org/jira/browse/FLINK-28032
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.3
>         Environment: Here are the environment details
>  * Flink version 1.14.3
>  * Running Kubernetes
>  * Using RocksDB state backend.
>  * Checkpoint storage is S3 storage using the Presto library
>  * Exactly Once Semantics with unaligned checkpoints enabled.
>  * Checkpoint timeout 2 hours
>  * Maximum concurrent checkpoints is 1
>  * Taskmanager CPU: 4, Slots: 1, Process Size: 12 GB
>  * Using Kafka for input and output
>            Reporter: Pauli Gandhi
>            Priority: Major
>         Attachments: checkpoint snapshot.png, jobgraph.png, 
> taskmanager_10.112.55.143_6122-969889_log, 
> taskmanager_10.112.55.143_6122-969889_thread_dump
>
>
> We have noticed that Flink jobs hangs and eventually times out after 2 hours 
> every time at the first checkpoint after it completes 15/23(65%) 
> acknowledgments.  There is no cpu/record processing activity but yet there 
> are a number of tasks reporting 100% back pressure.  It is peculiar to this 
> job and slight modifications to this job.  We have created many Flink jobs in 
> the past and never encountered the issue.  
> Here are the things we tried to narrow down the problem
>  * The job runs fine if checkpointing is disabled.
>  * Increasing the number of task managers and parallelism to 2 seems to help 
> the job complete.  However, it stalled again when we sent a larger data set.
>  * Increased taskmanager memory from 4 GB to 16 GB and cpu from 1 to 4 but 
> didn't help.
>  * Sometimes restarting the job manager helps but at other times not.
>  * Breaking up the job into smaller parts helps the job to finish.
>  * Analyzed the the thread dump and it appears all threads are either in 
> sleeping or wait state.
> I have attached the task manager logs (including debug logs for 
> checkpointing), thread dump, and screen shots of the job graph and stalled 
> checkpoint.
> Your help in resolving this issue is greatly appreciated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to