[ https://issues.apache.org/jira/browse/FLINK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889630#comment-16889630 ]
Xintong Song commented on FLINK-13242: -------------------------------------- Hi [~azagrebin], I think I found the problem. In _StandaloneResourceManager#initialize()_ it uses _getMainThreadExecutor()_ to execute _setFailUnfulfillableRequest()_. However, before _setFailUnfulfillableRequest()_ is executed, the main thread executor of the resource manager might be replaced by a new one when it accepts the granted leader ship, leading to _setFailUnfulfillableRequest()_ never being executed. This only happens when the _StandaloneResourceManager#initialize()_ is invoked before _TestingLeaderElectionService#isLeader()_. The problem can be re-produced and verified as follows: * Add logs in _StandaloneResourceManager#initialize()_, __ _TestingLeaderElectionService#isLeader()_ and _ResourceManager#setFailUnfulfillableRequest()_, and run the test. In most cases, you should see _TestingLeaderElectionService#isLeader()_ invoked before __ _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is invoked, and the test should pass. * Add a short sleep time (in my case 100ms) in _MockResourceManagerRuntimeServices#grantLeadership()_ before _rmLeaderElectionService.isLeader()_, and run the test again. Now you should see _TestingLeaderElectionService#isLeader()_ invoked after __ _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is never invoked, and the test should fail. * Add another short sleep time (also 100ms in my case) in _StandaloneResourceManager#initialize()_, inside __ the _getRpcService().getScheduledExecutor().schedule()_ block, right before _getMainThreadExecutor()_. This should change the order of invoking back and fix the failure. * If you invoke _getMainThreadExecutor()_ twice in _StandaloneResourceManager#initialize()_, once before the sleep and the other after it, and print out the fetched main thread executors, you should find that they are two different objects. * Now if you remove the sleep in _StandaloneResourceManager#initialize()_, you should see the printed two main thread executors are the same object, and the test is broken again. I'm thinking that maybe _setFailUnfulfillableRequest(true)_ does not need to be invoked on the PRC main thread. Instead of calling on the main thread executor, I tried call _setFailUnfulfillableRequest(true)_ directly in the _getRpcService().getScheduledExecutor().schedule()_ block in _StandaloneResourceManager#initialize()_ and it fixes the problem. I think we do not care whether the _setFailUnfulfillableRequest(true)_ happens on main thread or not in production, as long as it eventually get invoked. And for this test case, we may have a bit inconsistency that after _setFailUnfulfillableRequest(true)_ the _isFailingUnfulfillableRequest()_ may not get the correct result immediately, which I think is acceptable and the 10s timeout for _assertHappensUntil()_ should be long enough to catch the invoking of _setFailUnfulfillableRequest(true)_ eventually. What do you think? > StandaloneResourceManagerTest fails on travis > --------------------------------------------- > > Key: FLINK-13242 > URL: https://issues.apache.org/jira/browse/FLINK-13242 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.9.0 > Reporter: Chesnay Schepler > Assignee: Andrey Zagrebin > Priority: Blocker > Labels: pull-request-available > Fix For: 1.9.0, 1.10.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://travis-ci.org/apache/flink/jobs/557696989 > {code} > 08:28:06.475 [ERROR] > testStartupPeriod(org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest) > Time elapsed: 10.276 s <<< FAILURE! > java.lang.AssertionError: condition was not fulfilled before the deadline > at > org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest.assertHappensUntil(StandaloneResourceManagerTest.java:114) > at > org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest.testStartupPeriod(StandaloneResourceManagerTest.java:60) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)