[
https://issues.apache.org/jira/browse/MAPREDUCE-6507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133187#comment-15133187
]
Eric Badger commented on MAPREDUCE-6507:
----------------------------------------
Tests are failing because of a race condition between the RM startup and the NM
startup. In each of their serviceStart() methods, they are spawning new threads
to call start(), which introduces the race. The NM is set up with a waitCount
of up to 60 seconds, so that it can wait for the cluster to complete startup
(even though the start method for the RM has already returned). Removing the
threads fixes the race in the test that prompted this Jira (TestRMNMInfo), but
causes other tests to fail. Any tests that start up the MiniYARNCluster cluster
without an active RM will fail because the node managers block the main thread
from transitioning one of the RMs from standby to active. This is why the
threads worked, since it allowed the NMs to wait, while the main thread zoomed
by and transitioned a standby RM to active.
I propose changing the MiniYARNCluster start method such that it does not
complete until the cluster is completely started and to always make one RM
active in HA setups. This will require changes to the affected tests
(TestRMFailover, TestMiniYARNClusterForHA, etc.), but makes the code more
understandable and removes races. The tests are only passing right now because
of excessive timeouts to mask the race that they're fighting.
[~kasha] [~jlowe] Please advise.
> MiniYARNCluster.start() returns before cluster is completely started
> --------------------------------------------------------------------
>
> Key: MAPREDUCE-6507
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6507
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: test
> Reporter: Rohith Sharma K S
> Assignee: Eric Badger
> Attachments: MAPREDUCE-6507.001.patch
>
>
> TestRMNMInfo fails intermittently. Below is trace for the failure
> {noformat}
> testRMNMInfo(org.apache.hadoop.mapreduce.v2.TestRMNMInfo) Time elapsed: 0.28
> sec <<< FAILURE!
> java.lang.AssertionError: Unexpected number of live nodes: expected:<4> but
> was:<3>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:743)
> at org.junit.Assert.assertEquals(Assert.java:118)
> at org.junit.Assert.assertEquals(Assert.java:555)
> at
> org.apache.hadoop.mapreduce.v2.TestRMNMInfo.testRMNMInfo(TestRMNMInfo.java:111)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)