[ https://issues.apache.org/jira/browse/HDDS-1332?focusedWorklogId=223193&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-223193 ]
ASF GitHub Bot logged work on HDDS-1332: ---------------------------------------- Author: ASF GitHub Bot Created on: 04/Apr/19 18:44 Start Date: 04/Apr/19 18:44 Worklog Time Spent: 10m Work Description: adoroszlai commented on pull request #697: [HDDS-1332] Attempt to fix flaky test testStartStopDatanodeStateMachine URL: https://github.com/apache/hadoop/pull/697 ## What changes were proposed in this pull request? `testStartStopDatanodeStateMachine` is flaky, causing [occasional pre-commit build failures](https://builds.apache.org/job/hadoop-multibranch/job/PR-691/1/artifact/out/patch-unit-hadoop-hdds_container-service.txt). [HDDS-1332](https://issues.apache.org/jira/browse/HDDS-1332) added some logging to find out more about the cause. I think the problem is not test-specific, and is caused by the following: `SCMConnectionManager#scmMachines` is a plain `HashMap`, guarded by a `ReadWriteLock` in most places where it's used, except `getValues()`. The method also returns the values collection without any write protection (though currently none of the callers modify it). This is an attempt to fix the cause by acquiring the read lock and creating a read-only copy. https://issues.apache.org/jira/browse/HDDS-1332 ## How was this patch tested? Ran affected unit tests several times, plus tried `ozone` docker-compose cluster. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 223193) Time Spent: 40m (was: 0.5h) > Add some logging for flaky test testStartStopDatanodeStateMachine > ----------------------------------------------------------------- > > Key: HDDS-1332 > URL: https://issues.apache.org/jira/browse/HDDS-1332 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Reporter: Arpit Agarwal > Assignee: Arpit Agarwal > Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 40m > Remaining Estimate: 0h > > testStartStopDatanodeStateMachine fails frequently in Jenkins. It also seems > to have a timing issue which may be different from the Jenkins failure. > E.g. If I add a 10 second sleep as below I can get the test to fail 100%. > {code} > @@ -163,6 +163,7 @@ public void testStartStopDatanodeStateMachine() throws > IOException, > try (DatanodeStateMachine stateMachine = > new DatanodeStateMachine(getNewDatanodeDetails(), conf, null)) { > stateMachine.startDaemon(); > + Thread.sleep(10_000L); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org