[ https://issues.apache.org/jira/browse/FLINK-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780667#comment-16780667 ]
Chesnay Schepler commented on FLINK-10756: ------------------------------------------ The second issue is due to us not actually verifying that the third TM has joined the cluster. We issue the start command for TM3 and shoot down TM1, but there's no guarantee that by the time the job fails TM3 has already registered. If TM3 hasn't registered yet the cluster doesn't have enough slots and will fail the job right away. This is supported by the logs since no task is ever deployed to TM3. Not sure how to easily fix this since we don't have access to the REST API; I guess we could scan the TM output for a particular message. As for the first issue, this seems quite different. The job is properly recovered, with all tasks being deployed. However 1 (1/4) of the combiner tasks on TM3 never finish. > TaskManagerProcessFailureBatchRecoveryITCase did not finish on time > ------------------------------------------------------------------- > > Key: FLINK-10756 > URL: https://issues.apache.org/jira/browse/FLINK-10756 > Project: Flink > Issue Type: Bug > Components: Tests > Affects Versions: 1.6.2, 1.7.0 > Reporter: Bowen Li > Assignee: Chesnay Schepler > Priority: Critical > Labels: test-stability > Fix For: 1.8.0 > > > {code:java} > Failed tests: > > TaskManagerProcessFailureBatchRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure:207 > The program did not finish in time > {code} > https://travis-ci.org/apache/flink/jobs/449439623 -- This message was sent by Atlassian JIRA (v7.6.3#76005)