nkeywal created HBASE-6626:
------------------------------

             Summary: Add a chapter on HDFS in the troubleshooting section of 
the HBase reference guide.
                 Key: HBASE-6626
                 URL: https://issues.apache.org/jira/browse/HBASE-6626
             Project: HBase
          Issue Type: Improvement
          Components: documentation
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Priority: Minor


I looked mainly at the major failure case, but here is what I have:

New sub chapter in the existing chapter "Troubleshooting and Debugging HBase": 
"HDFS & HBASE"

1) HDFS & HBase
2) Connection related settings
2.1) Number of retries
2.2) Timeouts
3) Log samples


1) HDFS & HBase
HBase uses HDFS to store its HFile, i.e. the core HBase files and the 
Write-Ahead-Logs, i.e. the files that will be used to restore the data after a 
crash.
In both cases, the reliability of HBase comes from the fact that HDFS writes 
the data to multiple locations. To be efficient, HBase needs the data to be 
available locally, hence it's highly recommended to have the HDFS datanode on 
the same machines as the HBase Region Servers.

Detailed information on how HDFS works can be found at [1].

Important features are:
 - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class. 
This class can appears in HBase logs with other HDFS client related logs.
 - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS side, 
while some other are HDFS-client-side, i.e. must be set in HBase, while some 
other must be set in both places.
 - the HDFS writes are pipelined from one datanode to another. When writing, 
there are communications between:
    - HBase and HDFS namenode, through the HDFS client classes.
    - HBase and HDFS datanodes, through the HDFS client classes.
    - HDFS datanode between themselves: issues on these communications are in 
HDFS logs, not HBase. HDFS writes are always local when possible. As a 
consequence, there should not be much write error in HBase Region Servers: they 
write to the local datanode. If this datanode can't replicate the blocks, it 
will appear in its logs, not in the region servers logs.
 - datanodes can be contacted through the ipc.Client interface (once again this 
class can shows up in HBase logs) and the data transfer interface (usually 
shows up as the DataNode class in the HBase logs). There are on different ports 
(defaults being: 50010 and 50020).
 - To understand exactly what's going on, you must look that the HDFS log files 
as well: HBase logs represent the client side.
 - With the default setting, HDFS needs 630s to mark a datanode as dead. For 
this reason, this node will still be tried by HBase or by other datanodes when 
writing and reading until HDFS definitively decides it's dead. This will add 
some extras lines in the logs. This monitoring is performed by the NameNode.
 - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on the 
NameNode, but can mark temporally a node as dead if they had an error when they 
tried to use it.

2) Settings for retries and timeouts
2.1) Retries
ipc.client.connect.max.retries
Default 10
Indicates the number of retries a client will make to establish a server 
connection. Not taken into account if the error is a SocketTimeout. In this 
case the number of retries is 45 (fixed on branch, HADOOP-7932 or in 
HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be 
increased, especially if the socket timeouts have been lowered.

ipc.client.connect.max.retries.on.timeouts
Default 45
If you have HADOOP-7932, max number of retries on timeout. Counter is different 
than ipc.client.connect.max.retries so if you mix the socket errors you will 
get 55 retries with the default values. Could be lowered, once it is available. 
With HADOOP-7397 ipc.client.connect.max.retries is reused so there would be 10 
tries.

dfs.client.block.write.retries
Default 3
Number of tries for the client when writing a block. After a failure, will 
connect to the namenode a get a new location, sending the list of the datanodes 
already tried without success. Could be increased, especially if the socket 
timeouts have been lowered. See HBASE-6490.

dfs.client.block.write.locateFollowingBlock.retries
Default 5
Number of retries to the namenode when the client got 
NotReplicatedYetException, i.e. the existing nodes of the files are not yet 
replicated to dfs.replication.min. This should not impact HBase, as 
dfs.replication.min is defaulted to 1.

dfs.client.max.block.acquire.failures
Default 3
Number of tries to read a block from the datanodes list. In other words, if 5 
datanodes are supposed to hold a block (so dfs.replication equals to 5), the 
client will try all these datanodes, then check the value of 
dfs.client.max.block.acquire.failures to see if it should retry or not. If so, 
it will get a new list (likely the same), and will try to reconnect again to 
all these 5 datanodes. COuldbe be increased, especially if the socket timeouts 
have been lowered.
 

2.2) Timeouts
2.3.1) Heatbeats
dfs.heartbeat.interval
 Default is 3s

heartbeat.recheck.interval = 300s
 Defaults is 300S

A datanode is considered as dead when there is no heartbeat for (2 * 
heartbeat.recheck.interval + 10 * dfs.heartbeat.interval) seconds. That's 630s. 
 So before the 10:30 minutes, the datanode is considered as fully available by 
the namenode.  After this delay, HDFS is likely to start replicating the blocks 
contained in the dead node to get back to the right number of replica. As a 
consequence, if we're too aggressive we will have a side effect here, adding 
workload to an already damaged cluster. For this reason it's not recommended to 
change these settings.

As there are communications between the datanodes, and as they share these 
settings, these settings are both HDFS-client-side and HDFS-server-side.

2.3.2) Socket timeouts
3 timeouts are considered in HDFS:
 - connect timeout: the timeout when we tried to establish the connection
 - read timeout: the timeout when we read something on an already established 
connection
 - write timeout: the timeout when we try to write something on an already 
established connection.


They are managed by two settings:

dfs.socket.timeout
Default 60s

dfs.datanode.socket.write.timeout
Default is 480s.

But these setting are used:
- between the DFSClient and the datanode
- between the ipc.Client and the datanodes
- Between the datanodes
- sometimes but not always with an extension (depending on the number of 
replica)
- for dfs.socket.timeout as a socket connect timeout but as well as a socket 
read timeout.
- for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java 
socket is created in some cases instead of a NIO.


final calculated connect timeout can be:
 hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in 
HADOOP-7397)
 dfs.socket.timeout  (ex: DataNode#DataTransfer, DataXceiver#replaceBlock)
 dfs.socket.timeout + 3s*#replica  (ex: DataXceiver#write, 
DFSClient#getFileChecksum called from FileCheckSumServlet)

final read timeouts can be:
 dfs.socket.timeout  (DataXceiver#replaceBlock, ipc.Client from DFSClient)
 dfs.socket.timeout +  3s*#replica  (ex: DataNode#DataTransfer, 
DataXceiver#write)
 dfs.socket.timeout * #replica (ex: DataNode#DataTransfer)
 
final calculated write timeouts can be:
 dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...)
 dfs.datanode.socket.write.timeout + 5s*#replica) (ex 
DFSClient#createBlockOutputStream, DataXceiver#writeBlock)
 dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex: 
DataNode#DataTransfer. See HADOOP-5464).

Hence we will often see a 69000 timeout in the logs before the datanode is 
marked dead/excluded. Also, setting "dfs.socket.timeout" to 0 does not make it 
wait forever, but likely 9 seconds instead of 69s for data transfer.


3) Typical error logs.
3.1) Typical logs when all datanode for a block are dead, making the HBase 
recovery impossible. HBase master logs will contain, with a 0.90 HBase:
INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and 
continue java.net.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=/region-server-1:50010]
=> The client tries to connect to a dead datanode.
=> It failed, so the client will try the next datanode in the list. Usually the 
list size is 3 (dfs.replication).
=> If the final list is empty, it means that all the datanodes proposed by the 
namenode are in our datanodes list.
=> The HDFS client clears the dead nodes list and sleeps 3 seconds 
(hard-coded), shallowing InterruptedException, and asks again to the namenode. 
This is the log line:
INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node: 
java.io.IOException: No live nodes contain current block. Will get new block 
locations from namenode and retry...
=> All the locations initially given by the namenode to this client are 
actually dead. The client asks for a new set of locations.
=> We're very likely to have exactly the same datanode list as 3 seconds ago, 
except if a Datanode came back to life or if a replication has just finished.
=> After dfs.client.max.block.acquire.failures (default: 3), an exception is 
thrown, then logged, and we have in the logs:
WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block: 
blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
=> There is another retry, hard-coded to 2, but this is logged only once, even 
if the second try fails.
=> Moreover, for the second try the errors counters are not reinitialized, 
including the dead nodes list, so this second attempt is unlikely to succeed. 
It should come again with an empty node list, and throw a new 
java.io.IOException: Could not obtain block: blk_xxx 
file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
=> This exception will go to the final client (hbase). HBase will log it, and 
we will see
INFO wal.HLogSplitter: Got while parsing hlog 
HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy. Marking as 
corrupted java.io.IOException: Could not obtain block: blk_xxx 
file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy


3.2) Typical log for write issues: the master reads the log, then wants to 
split it, hence writing a block:
INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream 
java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel 
to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/ xxx:50010]
=> We tried to connect to the dead datanode to write. Likely from the master 
(it does not have a datanode, so it connects to a remote datanode).
=> A region server will not have this type of error, as it connects to a local 
datanode to write.
=> It failed at the beginning, we cannot connect at all (i.e. not during the 
write itself)
INFO HDFS.DFSClient: Abandoning block blk_xxx
=> HBase (as a HDFS client) told to the namenode that the block is not written.
INFO HDFS.DFSClient: Excluding datanode xxx:50010
=> Internally in HDFS client the stream puts it in the excludedNodes list (the 
"Excluding datanode" log line ).
=> The HDFS client is going again to the namenode asking for another datanode 
set proposal, sending the excluded datanode list to be sure it's not trying on 
the same nodes again.
=> There will be 3 retries by default. If you've lost 20% of your cluster 1% of 
the time the 3 attempts will fail. Setting: "dfs.client.block.write.retries". 
If it's the case (i.e. all attempts failed), next log line is:
WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to 
create new block. And then, if it was for a split log:
FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log 
(various possible stacks here)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to