[ 
https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212024#comment-13212024
 ] 

Aaron T. Myers commented on HDFS-2966:
--------------------------------------

Hey Steve, patch looks pretty good. I agree this issue could stand to be 
improved. I've also seen spurious failures in this test.

A few comments:

# In the spot where you call waitForGaugeValue for "FilesTotal", you also  
unnecessarily assert the value for FilesTotal.
# The name "waitForGaugeValue" seems a little misleading, since it's not a 
general-purpose method for gauges, but rather somewhat specific to gauges that 
are a function of _DN metrics_. Perhaps consider renaming it to something like 
"waitForDnMetricValue" ?
# Though the patch manages to get rid of the most race-prone sleeps (DN 
metrics), I don't think it will necessarily completely solve the issue for very 
slow VMs, since there are still several calls to updateMetrics. Can we 
completely remove the need for updateMetrics in this test, by waiting for a 
specific value as you've done here?
                
> TestNameNodeMetrics tests can fail under load
> ---------------------------------------------
>
>                 Key: HDFS-2966
>                 URL: https://issues.apache.org/jira/browse/HDFS-2966
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.24.0
>         Environment: OS/X running intellij IDEA, firefox, winxp in a 
> virtualbox.
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HDFS-2966.patch
>
>
> I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of 
> running the HDFS tests on a desktop with out enough memory for all the 
> programs trying to run. Things got swapped out and the tests failed as the DN 
> heartbeats didn't come in on time.
> the tests both rely on {{waitForDeletion()}} to block the tests until the 
> delete operation has completed, but all it does is sleep for the same number 
> of seconds as there are datanodes. This is too brittle -it may work on a 
> lightly-loaded system, but not on a system under heavy load where it is 
> taking longer to replicate than expect.
> Immediate fix: double, triple, the sleep time?
> Better fix: have the thread block until all the DN heartbeats have finished.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to