[ 
https://issues.apache.org/jira/browse/HDFS-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442588#comment-13442588
 ] 

nkeywal commented on HDFS-3705:
-------------------------------

Hello Suresh,

For sure, HDFS-3703 is absolutely key for HBase. It's great you're doing this. 
BTW, don't hesitate to ask me if you want me to test it with HBase. 

For HDFS-3705, it's a difficult question. There are two questions I think:
1) Is it supersedes by HDFS-3703
2) Can it be replaced by a server side only implementation.

For 1)
If I take HBase & HDFS as they are today, it's more a less a yes: Most of the 
time, people configure HBase with a timeout of 45s. So if HDFS does 30s, it has 
the right state when HBase starts the recovery. So done and HDFS-3705 is 
useless. 
However, even today, I've seen people claiming a configuration with a 10 
seconds timeout.
Looking further, even if a configuration with HDFS being more aggressive than 
HBase will always be simpler, I don't think we can have this as a systematic 
precondition:
- The hbase timeouts are driven a lot by GC issues. This stuff is getting 
resolved more and more, for example in JDK 1.7 new settings. If it works, HBase 
timeout will be decreased.
- As HBase is using a different detection mechanism than HDFS, we will always 
have mismatch. If ZooKeeper improves its detection mechanism, there will be a 
period of time with a better detection time from ZooKeeper than HBase. On the 
same line, there are the differences between connect/read timeouts: some issues 
are detected sooner than other. 
- If we want HDFS & HBase to be more and more realtime, settings will be more 
and more aggressive, and at the end the difference between HBase & HDFS will be 
a few seconds, i.e. something that you can't really rely on when there are 
failures on the cluster.

For 2) when we discussed on HBase list on HDFS-3702, doing this namenode side 
with HDFS-385 was rejected because the namenode could be shared between 
different teams / applications, and the operation team could refuse to deploy a 
namenode configuration specific to HBase. I guess it's a similar issue we're 
having there.


It's not simple; but given the above points, I think that even if HDFS-3703 
does 90% of the work, for the remaining 10% of the work, we need a deep 
cooperation between HBase & HDFS. Having the API LimitedPrivate is not an issue 
for HBase imho, and it buys some time to validate the API.

I'm happy to get other opinions here :-)

                
> Add the possibility to mark a node as 'low priority' for read in the DFSClient
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-3705
>                 URL: https://issues.apache.org/jira/browse/HDFS-3705
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs client
>    Affects Versions: 1.0.3, 2.0.0-alpha, 3.0.0
>            Reporter: nkeywal
>             Fix For: 3.0.0
>
>         Attachments: hdfs-3705.sample.patch, HDFS-3705.v1.patch
>
>
> This has been partly discussed in HBASE-6435.
> The DFSClient includes a 'bad nodes' management for reads and writes. 
> Sometimes, the client application already know that some deads are dead or 
> likely to be dead.
> An example is the 'HBase Write-Ahead-Log': when HBase reads this file, it 
> knows that the HBase regionserver died, and it's very likely that the box 
> died so the datanode on the same box is dead as well. This is actually 
> critical, because:
> - it's the hbase recovery that reads these log files
> - if we read them it means that we lost a box, so we have 1 dead replica out 
> the the 3. 
> - for all files read, we have 33% of chance to go to the dead datanode
> - as the box just died, we're very likely to get a timeout exception so we're 
> delaying the hbase recovery by 1 minute. For HBase, it means that the data is 
> not available during this minute.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to