[
https://issues.apache.org/jira/browse/HDFS-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442588#comment-13442588
]
nkeywal commented on HDFS-3705:
-------------------------------
Hello Suresh,
For sure, HDFS-3703 is absolutely key for HBase. It's great you're doing this.
BTW, don't hesitate to ask me if you want me to test it with HBase.
For HDFS-3705, it's a difficult question. There are two questions I think:
1) Is it supersedes by HDFS-3703
2) Can it be replaced by a server side only implementation.
For 1)
If I take HBase & HDFS as they are today, it's more a less a yes: Most of the
time, people configure HBase with a timeout of 45s. So if HDFS does 30s, it has
the right state when HBase starts the recovery. So done and HDFS-3705 is
useless.
However, even today, I've seen people claiming a configuration with a 10
seconds timeout.
Looking further, even if a configuration with HDFS being more aggressive than
HBase will always be simpler, I don't think we can have this as a systematic
precondition:
- The hbase timeouts are driven a lot by GC issues. This stuff is getting
resolved more and more, for example in JDK 1.7 new settings. If it works, HBase
timeout will be decreased.
- As HBase is using a different detection mechanism than HDFS, we will always
have mismatch. If ZooKeeper improves its detection mechanism, there will be a
period of time with a better detection time from ZooKeeper than HBase. On the
same line, there are the differences between connect/read timeouts: some issues
are detected sooner than other.
- If we want HDFS & HBase to be more and more realtime, settings will be more
and more aggressive, and at the end the difference between HBase & HDFS will be
a few seconds, i.e. something that you can't really rely on when there are
failures on the cluster.
For 2) when we discussed on HBase list on HDFS-3702, doing this namenode side
with HDFS-385 was rejected because the namenode could be shared between
different teams / applications, and the operation team could refuse to deploy a
namenode configuration specific to HBase. I guess it's a similar issue we're
having there.
It's not simple; but given the above points, I think that even if HDFS-3703
does 90% of the work, for the remaining 10% of the work, we need a deep
cooperation between HBase & HDFS. Having the API LimitedPrivate is not an issue
for HBase imho, and it buys some time to validate the API.
I'm happy to get other opinions here :-)
> Add the possibility to mark a node as 'low priority' for read in the DFSClient
> ------------------------------------------------------------------------------
>
> Key: HDFS-3705
> URL: https://issues.apache.org/jira/browse/HDFS-3705
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs client
> Affects Versions: 1.0.3, 2.0.0-alpha, 3.0.0
> Reporter: nkeywal
> Fix For: 3.0.0
>
> Attachments: hdfs-3705.sample.patch, HDFS-3705.v1.patch
>
>
> This has been partly discussed in HBASE-6435.
> The DFSClient includes a 'bad nodes' management for reads and writes.
> Sometimes, the client application already know that some deads are dead or
> likely to be dead.
> An example is the 'HBase Write-Ahead-Log': when HBase reads this file, it
> knows that the HBase regionserver died, and it's very likely that the box
> died so the datanode on the same box is dead as well. This is actually
> critical, because:
> - it's the hbase recovery that reads these log files
> - if we read them it means that we lost a box, so we have 1 dead replica out
> the the 3.
> - for all files read, we have 33% of chance to go to the dead datanode
> - as the box just died, we're very likely to get a timeout exception so we're
> delaying the hbase recovery by 1 minute. For HBase, it means that the data is
> not available during this minute.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira