[
https://issues.apache.org/jira/browse/FLINK-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913209#comment-16913209
]
Andrey Zagrebin commented on FLINK-13769:
-----------------------------------------
The problem is that after we merged waiting for all tasks termination in
TaskExecutor.onStop() (FLINK-11630),
this interrupted the testing mappers earlier in BatchFineGrainedRecoveryITCase,
causing concurrency problems with the next slot allocation.
The JM got notified about task failure faster and requested quickly a slot from
RM which has not realised yet that the slot of the stopping TM cannot be used
anymore. To fix this we need to deregister TM with the RM at the beginning of
the TaskExecutor.onStop().
Having deregistered itself, TM should stop reconnecting to RM. A preliminary
change is required for that to check the stopping state (FLINK-13819) of the TM
RpcEndpoint in TaskExecutor.disconnectResourceManager to decide whether to
reconnect.
> BatchFineGrainedRecoveryITCase.testProgram failed on Travis
> -----------------------------------------------------------
>
> Key: FLINK-13769
> URL: https://issues.apache.org/jira/browse/FLINK-13769
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0
> Reporter: Andrey Zagrebin
> Assignee: Andrey Zagrebin
> Priority: Critical
> Labels: test-stability
>
> {{BatchFineGrainedRecoveryITCase.testProgram}} failed on Travis.
> {code}
> 23:14:26.860 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time
> elapsed: 50.007 s <<< FAILURE! - in
> org.apache.flink.test.recovery.BatchFineGrainedRecoveryITCase
> 23:14:26.868 [ERROR]
> testProgram(org.apache.flink.test.recovery.BatchFineGrainedRecoveryITCase)
> Time elapsed: 49.469 s <<< ERROR!
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> at
> org.apache.flink.test.recovery.BatchFineGrainedRecoveryITCase.testProgram(BatchFineGrainedRecoveryITCase.java:225)
> Caused by: java.util.concurrent.CompletionException:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@localhost:39333/user/taskmanager_3#-344551647]] after
> [10000 ms]. Message of type
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@localhost:39333/user/taskmanager_3#-344551647]] after
> [10000 ms]. Message of type
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> {code}
> [https://travis-ci.org/apache/flink/jobs/573523669]
--
This message was sent by Atlassian Jira
(v8.3.2#803003)