[
https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746312#action_12746312
]
Jonathan Gray commented on HBASE-1084:
--------------------------------------
I'm not sure, need to test. But in any case, is there no way for us to be
smarter if NN keeps teling us dead DN but it continues to not work, can we not
ask/find a different one? I'm just a little concerned because we're wide open
to data loss even though HDFS is up.
As far as the lease lengths... yes, we need to have recommendations for it. We
could probably safely drop it down towards where our 0.19 lease lengths were,
30-60 seconds I guess. One warning though, if you do something like delete 25%
of the total size of HDFS at once, it can cause some rather long starvation on
the DNs, though since 0.20 I haven't seen it much past 10-12 seconds.
On dual-core configurations under high load, you'll need to be careful as well,
but this is the case for the hbase timeouts as well so should be okay.
> Reinitializable DFS client
> --------------------------
>
> Key: HBASE-1084
> URL: https://issues.apache.org/jira/browse/HBASE-1084
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: io, master, regionserver
> Reporter: Andrew Purtell
>
> HBase is the only long lived DFS client. Tasks handle DFS errors by dying.
> HBase daemons do not and instead depend on dfsclient error recovery
> capability, but that is not sufficiently developed or tested. Several issues
> are a result:
> * HBASE-846: hbase looses its mind when hdfs fills
> * HBASE-879: When dfs restarts or moves blocks around, hbase regionservers
> don't notice
> * HBASE-932: Regionserver restart
> * HBASE-1078: "java.io.IOException: Could not obtain block": allthough file
> is there and accessible through the dfs client
> * hlog indefinitely hung on getting new blocks from dfs on apurtell cluster
> * regions closed due to transient DFS problems during loaded cluster restart
> These issues might also be related:
> * HBASE-15: Could not complete hdfs write out to flush file forcing
> regionserver restart
> * HBASE-667: Hung regionserver; hung on hdfs: writeChunk,
> DFSClient.java:2126, DataStreamer socketWrite
> HBase should reinitialize the fs a few times upon catching fs exceptions,
> with backoff, to compensate. This can be done by making a wrapper around all
> fs operations that releases references to the old fs instance and makes and
> initializes a new instance to retry. All fs users would need to be fixed up
> to handle loss of state around fs wrapper invocations: hlog, memcache
> flusher, hstore, etc.
> Cases of clear unrecoverable failure (are there any?) should be excepted.
> Once the fs wrapper is in place, error recovery scenarios can be tested by
> forcing reinitialization of the fs during PE or other test cases.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.