[
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119520#comment-17119520
]
Stephen O'Donnell commented on HDFS-13179:
------------------------------------------
branch-3.0 is failing to compile, I think after this change was committed:
{code}
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile
(default-testCompile) on project hadoop-hdfs: Compilation failure
[ERROR]
/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestLazyPersistReplicaRecovery.java:[102,60]
incompatible types:
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor cannot be
converted to java.lang.Throwable
{code}
I think its due to a SLF4J style log message:
{code}
...
} catch (TimeoutException te) {
LOG.error("Timeout waiting for block report for {}", dnd);
return false;
}
{code}
I tested reverting the change locally and it fixed the compile problem. Should
we revert this change on branch-3.0 and push a new patch to fix the log message?
> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails
> intermittently
> ----------------------------------------------------------------------------------
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: fs
> Affects Versions: 3.0.0
> Reporter: Gabor Bota
> Assignee: Ahmed Hussein
> Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch,
> HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure
> that the file is replicated to DISK storage but the replication can't be
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(),
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially
> fixes the flakiness.
> {code:java}
> try {
> ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
> LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on
> RAM_DISK");
> ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
> }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode
> restarts. Restart line is the following in the log: LazyPersistTestCase -
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock:
> Redundant addStoredBlock request received for blk_1073741825_1001 on node
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main
> reason for the test fail. Something wrong with the gen stamp? Corrupt
> replicas?
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery:
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition.
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6
> on 39589"
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729]
> java.net.BindException: Address already in use; For more details see:
> http://wiki.apache.org/hadoop/BindException at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> Caused by: java.net.BindException: Address already in use at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]