[ 
https://issues.apache.org/jira/browse/FLINK-10756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780667#comment-16780667
 ] 

Chesnay Schepler commented on FLINK-10756:
------------------------------------------

The second issue is due to us not actually verifying that the third TM has 
joined the cluster. We issue the start command for TM3 and shoot down TM1, but 
there's no guarantee that by the time the job fails TM3 has already registered. 
If TM3 hasn't registered yet the cluster doesn't have enough slots and will 
fail the job right away.

This is supported by the logs since no task is ever deployed to TM3. Not sure 
how to easily fix this since we don't have access to the REST API; I guess we 
could scan the TM output for a particular message.

As for the first issue, this seems quite different. The job is properly 
recovered, with all tasks being deployed. However 1 (1/4) of the combiner tasks 
on TM3 never finish.

> TaskManagerProcessFailureBatchRecoveryITCase did not finish on time
> -------------------------------------------------------------------
>
>                 Key: FLINK-10756
>                 URL: https://issues.apache.org/jira/browse/FLINK-10756
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Bowen Li
>            Assignee: Chesnay Schepler
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.8.0
>
>
> {code:java}
> Failed tests: 
>   
> TaskManagerProcessFailureBatchRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure:207
>  The program did not finish in time
> {code}
> https://travis-ci.org/apache/flink/jobs/449439623



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to