[ 
https://issues.apache.org/jira/browse/FLINK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889630#comment-16889630
 ] 

Xintong Song commented on FLINK-13242:
--------------------------------------

Hi [~azagrebin], I think I found the problem.

In _StandaloneResourceManager#initialize()_ it uses _getMainThreadExecutor()_ 
to execute _setFailUnfulfillableRequest()_. However, before 
_setFailUnfulfillableRequest()_ is executed, the main thread executor of the 
resource manager might be replaced by a new one when it accepts the granted 
leader ship, leading to _setFailUnfulfillableRequest()_ never being executed. 
This only happens when the _StandaloneResourceManager#initialize()_ is invoked 
before _TestingLeaderElectionService#isLeader()_.

The problem can be re-produced and verified as follows:
 * Add logs in _StandaloneResourceManager#initialize()_, __ 
_TestingLeaderElectionService#isLeader()_ and 
_ResourceManager#setFailUnfulfillableRequest()_, and run the test. In most 
cases, you should see _TestingLeaderElectionService#isLeader()_ invoked before 
__ _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is 
invoked, and the test should pass.
 * Add a short sleep time (in my case 100ms) in 
_MockResourceManagerRuntimeServices#grantLeadership()_ before 
_rmLeaderElectionService.isLeader()_, and run the test again. Now you should 
see _TestingLeaderElectionService#isLeader()_ invoked after __ 
_StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is 
never invoked, and the test should fail.
 * Add another short sleep time (also 100ms in my case) in 
_StandaloneResourceManager#initialize()_, inside __ the 
_getRpcService().getScheduledExecutor().schedule()_ block, right before 
_getMainThreadExecutor()_. This should change the order of invoking back and 
fix the failure.
 * If you invoke _getMainThreadExecutor()_ twice in 
_StandaloneResourceManager#initialize()_, once before the sleep and the other 
after it, and print out the fetched main thread executors, you should find that 
they are two different objects.
 * Now if you remove the sleep in _StandaloneResourceManager#initialize()_, you 
should see the printed two main thread executors are the same object, and the 
test is broken again.

I'm thinking that maybe _setFailUnfulfillableRequest(true)_ does not need to be 
invoked on the PRC main thread. Instead of calling on the main thread executor, 
I tried call _setFailUnfulfillableRequest(true)_ directly in the 
_getRpcService().getScheduledExecutor().schedule()_ block in 
_StandaloneResourceManager#initialize()_ and it fixes the problem.

I think we do not care whether the _setFailUnfulfillableRequest(true)_ happens 
on main thread or not in production, as long as it eventually get invoked. And 
for this test case, we may have a bit inconsistency that after 
_setFailUnfulfillableRequest(true)_ the _isFailingUnfulfillableRequest()_ may 
not get the correct result immediately, which I think is acceptable and the 10s 
timeout for _assertHappensUntil()_ should be long enough to catch the invoking 
of _setFailUnfulfillableRequest(true)_ eventually. What do you think?

> StandaloneResourceManagerTest fails on travis
> ---------------------------------------------
>
>                 Key: FLINK-13242
>                 URL: https://issues.apache.org/jira/browse/FLINK-13242
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: Andrey Zagrebin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://travis-ci.org/apache/flink/jobs/557696989
> {code}
> 08:28:06.475 [ERROR] 
> testStartupPeriod(org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest)
>   Time elapsed: 10.276 s  <<< FAILURE!
> java.lang.AssertionError: condition was not fulfilled before the deadline
>       at 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest.assertHappensUntil(StandaloneResourceManagerTest.java:114)
>       at 
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest.testStartupPeriod(StandaloneResourceManagerTest.java:60)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to