[jira] [Commented] (HDFS-3703) Decrease the datanode failure detection time

Suresh Srinivas (JIRA) Mon, 10 Sep 2012 10:29:09 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452160#comment-13452160
 ]


Suresh Srinivas commented on HDFS-3703:
---------------------------------------

Nicolas, lets open a separate jira for the issue you mentioned related to 
DFSInputStream#readBlockLength instead of addressing it in this jira.

As regards to the patch in this jira, here is what my thoughts are:
# For read side the patch is straightforward. We have list of datanodes where 
the block is. We re-order it based on liveness.
# However for the write site, not picking the stale node could result in an 
issue, especially for small clusters. That is the reason why I think we should 
do the write side changes in a related jira. We should consider making stale 
timeout adaptive to the number of nodes marked stale in the cluster as 
discussed in the previous comments. Additionally we should consider having a 
separate configuration for write skipping the stale nodes.

Some early comments for the patch:
Comments:
# Typo compartor
# Add annotation @InterfaceAudience.Private to DecomStaleComparator class
# Default stale period could be a bit longer 30s. Again I know this is 
arbitrary, but still perfer longer timeout.
# Instead of BlockPlacementPolicyDefault#skipStaleNodes, rename it to 
checkForStaleNodes. Currently the variable name means the opposite of what it 
is used for.
# Can you add description of what stale means in javadoc for 
DatanodeInfo#isStale(). Add pointer to configuration that decides the stale 
period.
# DFS_DATANODE_STALE_STATE_ENABLE_KEY should be named 
DFS_NAMENODE_CHECK_STALE_DATANODE_KEY. (DFS_NAMENODE prefix means it is used by 
the namenode). Change the value to {{dfs.namenode....}}
# DFS_DATANODE_STALE_STATE_INTERVAL_KEY should be named 
DFS_NAMENODE_STALE_DATNODE_INTERVAL_KEY. Change the value to {{dfs.namenode...}}
# "node is staled" to "node is stale". In the same debug, it is a good idea to 
print the timesince last update. This should help in debugging.
# Why reset to default value if the value is smaller? We should just print 
warning and continue.
# Why add public method DatanodeManager#setCheckStaleDatanodes()?
# Instead of making setHeartbeatsDisabledForTests for public, you could provide 
access to that method using {{DatanodeTestUtils}} 
# Please add descritipn for the newly added properties in hdfs-default.xml and 
how it is used.

I have not reviewed the tests yet.


                
> Decrease the datanode failure detection time
> --------------------------------------------
>
>                 Key: HDFS-3703
>                 URL: https://issues.apache.org/jira/browse/HDFS-3703
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, name-node
>    Affects Versions: 1.0.3, 2.0.0-alpha
>            Reporter: nkeywal
>            Assignee: Suresh Srinivas
>         Attachments: HDFS-3703-branch2.patch, HDFS-3703.patch, 
> HDFS-3703-trunk-with-write.patch
>
>
> By default, if a box dies, the datanode will be marked as dead by the 
> namenode after 10:30 minutes. In the meantime, this datanode will still be 
> proposed  by the nanenode to write blocks or to read replicas. It happens as 
> well if the datanode crashes: there is no shutdown hooks to tell the nanemode 
> we're not there anymore.
> It especially an issue with HBase. HBase regionserver timeout for production 
> is often 30s. So with these configs, when a box dies HBase starts to recover 
> after 30s and, while 10 minutes, the namenode will consider the blocks on the 
> same box as available. Beyond the write errors, this will trigger a lot of 
> missed reads:
> - during the recovery, HBase needs to read the blocks used on the dead box 
> (the ones in the 'HBase Write-Ahead-Log')
> - after the recovery, reading these data blocks (the 'HBase region') will 
> fail 33% of the time with the default number of replica, slowering the data 
> access, especially when the errors are socket timeout (i.e. around 60s most 
> of the time). 
> Globally, it would be ideal if HDFS settings could be under HBase settings. 
> As a side note, HBase relies on ZooKeeper to detect regionservers issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3703) Decrease the datanode failure detection time

Reply via email to