Benoit Sigoure created HDFS-8960:
------------------------------------
Summary: DFS client says "no more good datanodes being available
to try" on a single drive failure
Key: HDFS-8960
URL: https://issues.apache.org/jira/browse/HDFS-8960
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client
Affects Versions: 2.7.1
Environment: openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
Reporter: Benoit Sigoure
Since we upgraded to 2.7.1 we regularly see single-drive failures cause
widespread problems at the HBase level (with the default 3x replication target).
Here's an example. This HBase RegionServer is r12s16 (172.24.32.16) and is
writing its WAL to [172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110]
as can be seen by the following occasional messages:
{code}
2015-08-23 06:28:40,272 INFO [sync.3] wal.FSHLog: Slow sync cost: 123 ms,
current pipeline: [172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110]
{code}
A bit later, the second node in the pipeline above is going to experience an
HDD failure.
{code}
2015-08-23 07:21:58,720 WARN [DataStreamer for file
/hbase/WALs/r12s16.sjc.aristanetworks.com,9104,1439917659071/r12s16.sjc.aristanetworks.com%2C9104%2C1439917659071.default.1440314434998
block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099]
hdfs.DFSClient: Error Recovery for block
BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099 in pipeline
172.24.32.16:10110, 172.24.32.13:10110, 172.24.32.8:10110: bad datanode
172.24.32.8:10110
{code}
And then HBase will go like "omg I can't write to my WAL, let me commit
suicide".
{code}
2015-08-23 07:22:26,060 FATAL
[regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1]
wal.FSHLog: Could not append. Requesting close of wal
java.io.IOException: Failed to replace a bad datanode on the existing pipeline
due to no more good datanodes being available to try. (Nodes:
current=[172.24.32.16:10110, 172.24.32.13:10110], original=[172.24.32.16:10110,
172.24.32.13:10110]). The current failed datanode replacement policy is
DEFAULT, and a client may configure this via
'dfs.client.block.write.replace-datanode-on-failure.policy' in its
configuration.
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
{code}
Whereas this should be mostly a non-event as the DFS client should just drop
the bad replica from the write pipeline.
This is a small cluster but has 16 DNs so the failed DN in the pipeline should
be easily replaced. I didn't set
{{dfs.client.block.write.replace-datanode-on-failure.policy}} (so it's still
{{DEFAULT}}) and didn't set
{{dfs.client.block.write.replace-datanode-on-failure.enable}} (so it's still
{{true}}).
I don't see anything noteworthy in the NN log around the time of the failure,
it just seems like the DFS client gave up or threw an exception back to HBase
that it wasn't throwing before or something else, and that made this single
drive failure lethal.
We've occasionally be "unlucky" enough to have a single-drive failure cause
multiple RegionServers to commit suicide because they had their WALs on that
drive.
We upgraded from 2.7.0 about a month ago, and I'm not sure whether we were
seeing this with 2.7 or not – prior to that we were running in a quite
different environment, but this is a fairly new deployment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)