[
https://issues.apache.org/jira/browse/HBASE-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057030#comment-14057030
]
Hadoop QA commented on HBASE-6626:
----------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12654912/HBASE-6626.patch
against trunk revision .
ATTACHMENT ID: 12654912
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+0 tests included{color}. The patch appears to be a
documentation patch that doesn't require tests.
{color:red}-1 patch{color}. The patch command could not apply the patch.
Console output:
https://builds.apache.org/job/PreCommit-HBASE-Build/10009//console
This message is automatically generated.
> Add a chapter on HDFS in the troubleshooting section of the HBase reference
> guide.
> ----------------------------------------------------------------------------------
>
> Key: HBASE-6626
> URL: https://issues.apache.org/jira/browse/HBASE-6626
> Project: HBase
> Issue Type: Improvement
> Components: documentation
> Affects Versions: 0.95.2
> Reporter: Nicolas Liochon
> Assignee: Misty Stanley-Jones
> Priority: Blocker
> Attachments: HBASE-6626.patch, troubleshooting.txt
>
>
> I looked mainly at the major failure case, but here is what I have:
> New sub chapter in the existing chapter "Troubleshooting and Debugging
> HBase": "HDFS & HBASE"
> 1) HDFS & HBase
> 2) Connection related settings
> 2.1) Number of retries
> 2.2) Timeouts
> 3) Log samples
> 1) HDFS & HBase
> HBase uses HDFS to store its HFile, i.e. the core HBase files and the
> Write-Ahead-Logs, i.e. the files that will be used to restore the data after
> a crash.
> In both cases, the reliability of HBase comes from the fact that HDFS writes
> the data to multiple locations. To be efficient, HBase needs the data to be
> available locally, hence it's highly recommended to have the HDFS datanode on
> the same machines as the HBase Region Servers.
> Detailed information on how HDFS works can be found at [1].
> Important features are:
> - HBase is a client application of HDFS, i.e. uses the HDFS DFSClient class.
> This class can appears in HBase logs with other HDFS client related logs.
> - Some HDFS settings are HDFS-server-side, i.e. must be set on the HDFS
> side, while some other are HDFS-client-side, i.e. must be set in HBase, while
> some other must be set in both places.
> - the HDFS writes are pipelined from one datanode to another. When writing,
> there are communications between:
> - HBase and HDFS namenode, through the HDFS client classes.
> - HBase and HDFS datanodes, through the HDFS client classes.
> - HDFS datanode between themselves: issues on these communications are in
> HDFS logs, not HBase. HDFS writes are always local when possible. As a
> consequence, there should not be much write error in HBase Region Servers:
> they write to the local datanode. If this datanode can't replicate the
> blocks, it will appear in its logs, not in the region servers logs.
> - datanodes can be contacted through the ipc.Client interface (once again
> this class can shows up in HBase logs) and the data transfer interface
> (usually shows up as the DataNode class in the HBase logs). There are on
> different ports (defaults being: 50010 and 50020).
> - To understand exactly what's going on, you must look that the HDFS log
> files as well: HBase logs represent the client side.
> - With the default setting, HDFS needs 630s to mark a datanode as dead. For
> this reason, this node will still be tried by HBase or by other datanodes
> when writing and reading until HDFS definitively decides it's dead. This will
> add some extras lines in the logs. This monitoring is performed by the
> NameNode.
> - The HDFS clients (i.e. HBase using HDFS client code) don't fully rely on
> the NameNode, but can mark temporally a node as dead if they had an error
> when they tried to use it.
> 2) Settings for retries and timeouts
> 2.1) Retries
> ipc.client.connect.max.retries
> Default 10
> Indicates the number of retries a client will make to establish a server
> connection. Not taken into account if the error is a SocketTimeout. In this
> case the number of retries is 45 (fixed on branch, HADOOP-7932 or in
> HADOOP-7397). For SASL, the number of retries is hard-coded to 15. Can be
> increased, especially if the socket timeouts have been lowered.
> ipc.client.connect.max.retries.on.timeouts
> Default 45
> If you have HADOOP-7932, max number of retries on timeout. Counter is
> different than ipc.client.connect.max.retries so if you mix the socket errors
> you will get 55 retries with the default values. Could be lowered, once it is
> available. With HADOOP-7397 ipc.client.connect.max.retries is reused so there
> would be 10 tries.
> dfs.client.block.write.retries
> Default 3
> Number of tries for the client when writing a block. After a failure, will
> connect to the namenode a get a new location, sending the list of the
> datanodes already tried without success. Could be increased, especially if
> the socket timeouts have been lowered. See HBASE-6490.
> dfs.client.block.write.locateFollowingBlock.retries
> Default 5
> Number of retries to the namenode when the client got
> NotReplicatedYetException, i.e. the existing nodes of the files are not yet
> replicated to dfs.replication.min. This should not impact HBase, as
> dfs.replication.min is defaulted to 1.
> dfs.client.max.block.acquire.failures
> Default 3
> Number of tries to read a block from the datanodes list. In other words, if 5
> datanodes are supposed to hold a block (so dfs.replication equals to 5), the
> client will try all these datanodes, then check the value of
> dfs.client.max.block.acquire.failures to see if it should retry or not. If
> so, it will get a new list (likely the same), and will try to reconnect again
> to all these 5 datanodes. COuldbe be increased, especially if the socket
> timeouts have been lowered.
>
> 2.2) Timeouts
> 2.3.1) Heatbeats
> dfs.heartbeat.interval
> Default is 3s
> heartbeat.recheck.interval = 300s
> Defaults is 300S
> A datanode is considered as dead when there is no heartbeat for (2 *
> heartbeat.recheck.interval + 10 * dfs.heartbeat.interval) seconds. That's
> 630s. So before the 10:30 minutes, the datanode is considered as fully
> available by the namenode. After this delay, HDFS is likely to start
> replicating the blocks contained in the dead node to get back to the right
> number of replica. As a consequence, if we're too aggressive we will have a
> side effect here, adding workload to an already damaged cluster. For this
> reason it's not recommended to change these settings.
> As there are communications between the datanodes, and as they share these
> settings, these settings are both HDFS-client-side and HDFS-server-side.
> 2.3.2) Socket timeouts
> 3 timeouts are considered in HDFS:
> - connect timeout: the timeout when we tried to establish the connection
> - read timeout: the timeout when we read something on an already established
> connection
> - write timeout: the timeout when we try to write something on an already
> established connection.
> They are managed by two settings:
> dfs.socket.timeout
> Default 60s
> dfs.datanode.socket.write.timeout
> Default is 480s.
> But these setting are used:
> - between the DFSClient and the datanode
> - between the ipc.Client and the datanodes
> - Between the datanodes
> - sometimes but not always with an extension (depending on the number of
> replica)
> - for dfs.socket.timeout as a socket connect timeout but as well as a socket
> read timeout.
> - for dfs.datanode.socket.write.timeout, when it's set to 0, a plain old java
> socket is created in some cases instead of a NIO.
> final calculated connect timeout can be:
> hard-coded to 20s for the the ipc.Client in Hadoop 1.0.3 (changed in
> HADOOP-7397)
> dfs.socket.timeout (ex: DataNode#DataTransfer, DataXceiver#replaceBlock)
> dfs.socket.timeout + 3s*#replica (ex: DataXceiver#write,
> DFSClient#getFileChecksum called from FileCheckSumServlet)
> final read timeouts can be:
> dfs.socket.timeout (DataXceiver#replaceBlock, ipc.Client from DFSClient)
> dfs.socket.timeout + 3s*#replica (ex: DataNode#DataTransfer,
> DataXceiver#write)
> dfs.socket.timeout * #replica (ex: DataNode#DataTransfer)
>
> final calculated write timeouts can be:
> dfs.datanode.socket.write.timeout (ex DataXceiver#copyBlock/readBlock/...)
> dfs.datanode.socket.write.timeout + 5s*#replica) (ex
> DFSClient#createBlockOutputStream, DataXceiver#writeBlock)
> dfs.datanode.socket.write.timeout + 5s*(#replica -1) (ex:
> DataNode#DataTransfer. See HADOOP-5464).
> Hence we will often see a 69000 timeout in the logs before the datanode is
> marked dead/excluded. Also, setting "dfs.socket.timeout" to 0 does not make
> it wait forever, but likely 9 seconds instead of 69s for data transfer.
> 3) Typical error logs.
> 3.1) Typical logs when all datanode for a block are dead, making the HBase
> recovery impossible. HBase master logs will contain, with a 0.90 HBase:
> INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and
> continue java.net.SocketTimeoutException: 60000 millis timeout while waiting
> for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=/region-server-1:50010]
> => The client tries to connect to a dead datanode.
> => It failed, so the client will try the next datanode in the list. Usually
> the list size is 3 (dfs.replication).
> => If the final list is empty, it means that all the datanodes proposed by
> the namenode are in our datanodes list.
> => The HDFS client clears the dead nodes list and sleeps 3 seconds
> (hard-coded), shallowing InterruptedException, and asks again to the
> namenode. This is the log line:
> INFO HDFS.DFSClient: Could not obtain block blk_xxx from any node:
> java.io.IOException: No live nodes contain current block. Will get new block
> locations from namenode and retry...
> => All the locations initially given by the namenode to this client are
> actually dead. The client asks for a new set of locations.
> => We're very likely to have exactly the same datanode list as 3 seconds ago,
> except if a Datanode came back to life or if a replication has just finished.
> => After dfs.client.max.block.acquire.failures (default: 3), an exception is
> thrown, then logged, and we have in the logs:
> WARN HDFS.DFSClient: DFS Read: java.io.IOException: Could not obtain block:
> blk_xxx file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => There is another retry, hard-coded to 2, but this is logged only once,
> even if the second try fails.
> => Moreover, for the second try the errors counters are not reinitialized,
> including the dead nodes list, so this second attempt is unlikely to succeed.
> It should come again with an empty node list, and throw a new
> java.io.IOException: Could not obtain block: blk_xxx
> file=/hbase/.logs/boxxxx,60020,xxx/xxx%3A60020.yyy
> => This exception will go to the final client (hbase). HBase will log it, and
> we will see
> INFO wal.HLogSplitter: Got while parsing hlog
> HDFS://namodenode:8020//hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy. Marking as
> corrupted java.io.IOException: Could not obtain block: blk_xxx
> file=/hbase/.logs/boxxxx,60020,xxx/xxx60020.yyy
> 3.2) Typical log for write issues: the master reads the log, then wants to
> split it, hence writing a block:
> INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/ xxx:50010]
> => We tried to connect to the dead datanode to write. Likely from the master
> (it does not have a datanode, so it connects to a remote datanode).
> => A region server will not have this type of error, as it connects to a
> local datanode to write.
> => It failed at the beginning, we cannot connect at all (i.e. not during the
> write itself)
> INFO HDFS.DFSClient: Abandoning block blk_xxx
> => HBase (as a HDFS client) told to the namenode that the block is not
> written.
> INFO HDFS.DFSClient: Excluding datanode xxx:50010
> => Internally in HDFS client the stream puts it in the excludedNodes list
> (the "Excluding datanode" log line ).
> => The HDFS client is going again to the namenode asking for another datanode
> set proposal, sending the excluded datanode list to be sure it's not trying
> on the same nodes again.
> => There will be 3 retries by default. If you've lost 20% of your cluster 1%
> of the time the 3 attempts will fail. Setting:
> "dfs.client.block.write.retries". If it's the case (i.e. all attempts
> failed), next log line is:
> WARN HDFS.DFSClient: DataStreamer Exception: java.io.IOException: Unable to
> create new block. And then, if it was for a split log:
> FATAL wal.HLogSplitter: WriterThread-xxx Got while writing log entry to log
> (various possible stacks here)
--
This message was sent by Atlassian JIRA
(v6.2#6252)