[
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395838#comment-16395838
]
Gabor Bota commented on HDFS-13179:
-----------------------------------
[~bharatviswa] if you are speaking about increasing the timeout, it won't fix
the issue. Once the {noformat} addStoredBlock: Redundant addStoredBlock request
received {noformat} happens the test will fail.
Do you maybe know what can cause the redundant request, and how to prevent it?
> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails
> intermittently
> ----------------------------------------------------------------------------------
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: fs
> Affects Versions: 3.0.0
> Reporter: Gabor Bota
> Priority: Critical
>
> The error caused by TimeoutException because the test is waiting to ensure
> that the file is replicated to DISK storage but the replication can't be
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(),
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially
> fixes the flakiness.
> {code:java}
> try {
> ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
> LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on
> RAM_DISK");
> ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
> }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode
> restarts. Restart line is the following in the log: LazyPersistTestCase -
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock:
> Redundant addStoredBlock request received for blk_1073741825_1001 on node
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main
> reason for the test fail. Something wrong with the gen stamp? Corrupt
> replicas?
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery:
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition.
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6
> on 39589"
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729]
> java.net.BindException: Address already in use; For more details see:
> http://wiki.apache.org/hadoop/BindException at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> Caused by: java.net.BindException: Address already in use at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]