[ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395805#comment-16395805
 ] 

Bharat Viswanadham commented on HDFS-13179:
-------------------------------------------

Hi [~gabor.bota]

If we do that, then the replica is not propagated to disk and the test caseĀ 
purpose of checking whether it is moved to disk or not is checked, if we add 
such a code right? Let me know If I am missing something here?

testDnRestartWithSavedReplicas will not be tested. So, I am thinking can we 
iterate for more than 30 seconds and check the test case behavior?

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13179
>                 URL: https://issues.apache.org/jira/browse/HDFS-13179
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.0.0
>            Reporter: Gabor Bota
>            Priority: Critical
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
>     try {
>       ensureFileReplicasOnStorageType(path1, DEFAULT);
>     }catch (TimeoutException t){
>       LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>       ensureFileReplicasOnStorageType(path1, RAM_DISK);
>     }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to