[ 
https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420851#comment-13420851
 ] 

nkeywal commented on HDFS-3703:
-------------------------------

bq. Can you describe this better?
If we see this in layers, we've got three layers:
1) Hardware
2) HDFS
3) HBase

Here, the layer3 knows/guess that the layer1 is dead, while the layer in the 
middle does not know it. That's not a perfect example of encapsulation :-). 
HBase is saying to hdfs 'you know, I want some blocks, but may be this datanode 
is not good, I'm not sure, but don't use it please'. Kind of strange (but 
useful short term).

Today, when there is a global issue, HBase starts its recovery while hdfs is 
still ignoring the issue. It leads to a nightmare of socket exception all over 
the place, as HBase is directed to dead nodes again and again. HDFS should know 
before HBase what's going on. So if HBase it set with a timeout of 30s, HDFS 
should have 20s or something like this.

bq. Whether ZooKeeper or Datanode heartbeat to Namenode, at a high level 
mechanisms are similar. 

Fully agreed. Just that if the issues comes from ZK or ZK links, HBase and HDFS 
they would have a similar view of the situation (may be a wrong view but the 
same view). On the other hand, there are possible improvements, not available 
in ZK, but hopefully available a day, when there will be more code to share 
(I'm thinking about ZOOKEEPER-702). Also, still long term, ZK creates one tcp 
connection per process monitored. If multiple hadoop processes share the same 
tech, it will make sense to have a shared component on each computer to lower 
the number of connections. I'm not aware on anything on this subject in ZK, so 
that's science fiction today. I've got other stuff like this in mind, but you 
got the idea :-).

So, I fully agree with your main point, today the real issue is the right 
timeout.

bq. The problem is one of choosing right timeout. Currently this is 
configurable in HDFS and 10 minutes is chosen as the timeout. I suggest 
runningt some experiments with setting this to a more aggressive value. I agree 
that this is a very conservative time. But false positives here could result in 
replication storm.

Agreed, even we've the current setting, people had issues in the past. 10 
minutes seems to be a reasonable-real world-validated timeout for 
re-replicating. I don't think it's a good idea to make lower. However, I think 
it would be good to have a middle state between fully available and 
definitively dead: the non responding nodes could be removed from the target 
list for new blocks and de-prioritize for reads.

                
> Decrease the datanode failure detection time
> --------------------------------------------
>
>                 Key: HDFS-3703
>                 URL: https://issues.apache.org/jira/browse/HDFS-3703
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, name-node
>    Affects Versions: 1.0.3, 2.0.0-alpha
>            Reporter: nkeywal
>
> By default, if a box dies, the datanode will be marked as dead by the 
> namenode after 10:30 minutes. In the meantime, this datanode will still be 
> proposed  by the nanenode to write blocks or to read replicas. It happens as 
> well if the datanode crashes: there is no shutdown hooks to tell the nanemode 
> we're not there anymore.
> It especially an issue with HBase. HBase regionserver timeout for production 
> is often 30s. So with these configs, when a box dies HBase starts to recover 
> after 30s and, while 10 minutes, the namenode will consider the blocks on the 
> same box as available. Beyond the write errors, this will trigger a lot of 
> missed reads:
> - during the recovery, HBase needs to read the blocks used on the dead box 
> (the ones in the 'HBase Write-Ahead-Log')
> - after the recovery, reading these data blocks (the 'HBase region') will 
> fail 33% of the time with the default number of replica, slowering the data 
> access, especially when the errors are socket timeout (i.e. around 60s most 
> of the time). 
> Globally, it would be ideal if HDFS settings could be under HBase settings. 
> As a side note, HBase relies on ZooKeeper to detect regionservers issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to