[
https://issues.apache.org/jira/browse/FLINK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220750#comment-16220750
]
ASF GitHub Bot commented on FLINK-7844:
---------------------------------------
Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/4844
Had an offline discussion with @tillrohrmann - rewriting this without
Mockito results in a similar amount of code with similar maintenance effort, so
seems to be okay in this case.
+1 to merge after fixing the `Thread.holdsLock(lock)` comment above
> Fine Grained Recovery triggers checkpoint timeout failure
> ---------------------------------------------------------
>
> Key: FLINK-7844
> URL: https://issues.apache.org/jira/browse/FLINK-7844
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.4.0, 1.3.2
> Reporter: Zhenzhong Xu
> Assignee: Zhenzhong Xu
> Attachments: screenshot-1.png
>
>
> Context:
> We are using "individual" failover (fine-grained) recovery strategy for our
> embarrassingly parallel router use case. The topic has over 2000 partitions,
> and parallelism is set to ~180 that dispatched to over 20 task managers with
> around 180 slots.
> Observations:
> We've noticed after one task manager termination, even though the individual
> recovery happens correctly, that the workload was re-dispatched to a new
> available task manager instance. However, the checkpoint would take 10 mins
> to eventually timeout, causing all other task managers not able to commit
> checkpoints. In a worst-case scenario, if job got restarted for other reasons
> (i.e. job manager termination), that would cause more messages to be
> re-processed/duplicates compared to the job without fine-grained recovery
> enabled.
> I am suspecting that uber checkpoint was waiting for a previous checkpoint
> that initiated by the old task manager and thus taking a long time to time
> out.
> Two questions:
> 1. Is there a configuration that controls this checkpoint timeout?
> 2. Is there any reason that when Job Manager realizes that Task Manager is
> gone and workload is redispatched, it still need to wait for the checkpoint
> initiated by the old task manager?
> Checkpoint screenshot in attachments.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)