[ 
https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712457#action_12712457
 ] 

stack commented on HBASE-1084:
------------------------------

dfs.client.max.block.acquire.failures looks like it might be useful.  Could 
double this rather than mess with the timer.

Root problem though seems to be 
https://issues.apache.org/jira/browse/HADOOP-5903  Lets figure out a patch for 
it and recommend backporting it in hbase installs on hadoop 0.19.x and 0.20.x.

> Reinitializable DFS client
> --------------------------
>
>                 Key: HBASE-1084
>                 URL: https://issues.apache.org/jira/browse/HBASE-1084
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io, master, regionserver
>            Reporter: Andrew Purtell
>
> HBase is the only long lived DFS client. Tasks handle DFS errors by dying. 
> HBase daemons do not and instead depend on dfsclient error recovery 
> capability, but that is not sufficiently developed or tested. Several issues 
> are a result:
> * HBASE-846: hbase looses its mind when hdfs fills
> * HBASE-879: When dfs restarts or moves blocks around, hbase regionservers 
> don't notice
> * HBASE-932: Regionserver restart
> * HBASE-1078: "java.io.IOException: Could not obtain block": allthough file 
> is there and accessible through the dfs client
> * hlog indefinitely hung on getting new blocks from dfs on apurtell cluster
> * regions closed due to transient DFS problems during loaded cluster restart
> These issues might also be related:
> * HBASE-15: Could not complete hdfs write out to flush file forcing 
> regionserver restart
> * HBASE-667: Hung regionserver; hung on hdfs: writeChunk, 
> DFSClient.java:2126, DataStreamer socketWrite
> HBase should reinitialize the fs a few times upon catching fs exceptions, 
> with backoff, to compensate. This can be done by making a wrapper around all 
> fs operations that releases references to the old fs instance and makes and 
> initializes a new instance to retry. All fs users would need to be fixed up 
> to handle loss of state around fs wrapper invocations: hlog, memcache 
> flusher, hstore, etc. 
> Cases of clear unrecoverable failure (are there any?) should be excepted.
> Once the fs wrapper is in place, error recovery scenarios can be tested by 
> forcing reinitialization of the fs during PE or other test cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to