[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

Ahmed Hussein (Jira) Mon, 27 Jan 2020 14:20:25 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024713#comment-17024713
 ]


Ahmed Hussein commented on HDFS-13179:
--------------------------------------

Thanks [~inigoiri]! This is a good catch. I am uploading a new patch with the 
changes.
FWIW, the entire LazyPersistTestCase.java have similar issues with waiting for 
very long periods of time. The test cases could be optimized by polling on a 
condition for short intervals. Also, {{triggerBlockReport}} only triggers a 
block report on node index 0. This leads to many problems in test cases that 
may expect triggering the block on all the nodes. This could explain why this 
test case was failing since restarting the DN will change the index; it never 
receives a request to trigger FBR.

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13179
>                 URL: https://issues.apache.org/jira/browse/HDFS-13179
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.0.0
>            Reporter: Gabor Bota
>            Assignee: Ahmed Hussein
>            Priority: Critical
>         Attachments: HDFS-13179.001.patch, HDFS-13179.002.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
>     try {
>       ensureFileReplicasOnStorageType(path1, DEFAULT);
>     }catch (TimeoutException t){
>       LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>       ensureFileReplicasOnStorageType(path1, RAM_DISK);
>     }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

Reply via email to