[
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395888#comment-16395888
]
Íñigo Goiri commented on HDFS-13179:
------------------------------------
FWIW, not completely related but TestLazyPersistReplicaRecovery cannot even
start in Windows:
{code}
[ERROR]
testDnRestartWithSavedReplicas(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery)
Time elapsed: 2.804 s <<< ERROR!
1450: Insufficient system resources exist to complete the requested service.
at
org.apache.hadoop.io.nativeio.NativeIO$Windows.extendWorkingSetSize(Native
Method)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1373)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:497)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2769)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2677)
at
org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1643)
at
org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:885)
at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:497)
at
org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:456)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.startUpCluster(LazyPersistTestCase.java:315)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase$ClusterWithRamDiskBuilder.build(LazyPersistTestCase.java:414)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}
Probably this should be a separate JIRA though.
> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails
> intermittently
> ----------------------------------------------------------------------------------
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: fs
> Affects Versions: 3.0.0
> Reporter: Gabor Bota
> Priority: Critical
>
> The error caused by TimeoutException because the test is waiting to ensure
> that the file is replicated to DISK storage but the replication can't be
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(),
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially
> fixes the flakiness.
> {code:java}
> try {
> ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
> LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on
> RAM_DISK");
> ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
> }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode
> restarts. Restart line is the following in the log: LazyPersistTestCase -
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock:
> Redundant addStoredBlock request received for blk_1073741825_1001 on node
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main
> reason for the test fail. Something wrong with the gen stamp? Corrupt
> replicas?
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery:
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition.
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6
> on 39589"
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729]
> java.net.BindException: Address already in use; For more details see:
> http://wiki.apache.org/hadoop/BindException at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> Caused by: java.net.BindException: Address already in use at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]