[
https://issues.apache.org/jira/browse/FLINK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389601#comment-16389601
]
ASF GitHub Bot commented on FLINK-8887:
---------------------------------------
GitHub user zentol opened a pull request:
https://github.com/apache/flink/pull/5657
[FLINK-8887][tests] Add single retry in MiniClusterClient
## What is the purpose of the change
This PR presents a test workaround for race-conditions in FLIP-6 (most
notably FLINK-8887). Basically, every `MiniClusterClient` call is retried
*once* after 500ms in case of certain exceptions.
**This is only a band-aid until a proper fix is in place** so we can
finally continue merging more test ports.
## Brief change log
* add `guardWithSingleRetry` convenience method
* add `ScheduledExecutor` to `MiniClusterClient`
* guard all calls to the `MiniCluster`
## Verifying this change
The change can be verified by cherry-picking [this
branch](https://github.com/zentol/flink/tree/8797) and running the
`AbstractOperatorRestoreTestBase`. Before this change there was always 1-2
tests failing, whereas now none should fail.
/cc @aljoscha @GJL
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zentol/flink 8887_bandaid
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/5657.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5657
----
commit f685fad6731b7c1774b247985a446260ea285663
Author: zentol <chesnay@...>
Date: 2018-03-07T14:02:05Z
[FLINK-8887][tests] Add single retry in MiniClusterClient
----
> ClusterClient.getJobStatus can throw FencingTokenException
> ----------------------------------------------------------
>
> Key: FLINK-8887
> URL: https://issues.apache.org/jira/browse/FLINK-8887
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.0
> Reporter: Gary Yao
> Assignee: vinoyang
> Priority: Blocker
> Labels: flip-6
> Fix For: 1.5.0
>
>
> *Description*
> Calling {{RestClusterClient.getJobStatus}} or
> {{MiniClusterClient.getJobStatus}} can result in a {{FencingTokenException}}.
> *Analysis*
> {{Dispatcher.requestJobStatus}} first looks the {{JobManagerRunner}} up by
> job id. If a reference is found, {{requestJobStatus}} is called on the
> respective instance. If not, the {{ArchivedExecutionGraphStore}} is queried.
> However, between the lookup and the method call, the {{JobMaster}} of the
> respective job may have lost leadership already (job finished), and has set
> the fencing token to {{null}}.
> *Stacktrace*
> {noformat}
> Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException:
> Fencing token mismatch: Ignoring message LocalFencedMessage(null,
> LocalRpcInvocation(requestJobStatus(Time))) because the fencing token null
> did not match the expected fencing token b8423c75bc6838244b8c93c8bd4a4f51.
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:73)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
> at
> akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}
> {noformat}
> Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException:
> Fencing token not set: Ignoring message LocalFencedMessage(null,
> LocalRpcInvocation(requestJobStatus(Time))) because the fencing token is null.
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:56)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
> at
> akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)