Reinitializable DFS client
--------------------------

                 Key: HBASE-1084
                 URL: https://issues.apache.org/jira/browse/HBASE-1084
             Project: Hadoop HBase
          Issue Type: Improvement
          Components: io, master, regionserver
            Reporter: Andrew Purtell
             Fix For: 0.20.0


HBase is the only long lived DFS client. Tasks handle DFS errors by dying. 
HBase daemons do not and instead depend on dfsclient error recovery capability, 
but that is not sufficiently developed or tested. Several issues are a result:
* HBASE-846: hbase looses its mind when hdfs fills
* HBASE-879: When dfs restarts or moves blocks around, hbase regionservers 
don't notice
* HBASE-932: Regionserver restart
* HBASE-1078: "java.io.IOException: Could not obtain block": allthough file is 
there and accessible through the dfs client
* hlog indefinitely hung on getting new blocks from dfs on apurtell cluster
* regions closed due to transient DFS problems during loaded cluster restart

These issues might also be related:
* HBASE-15: Could not complete hdfs write out to flush file forcing 
regionserver restart
* HBASE-667: Hung regionserver; hung on hdfs: writeChunk, DFSClient.java:2126, 
DataStreamer socketWrite

HBase should reinitialize the fs a few times upon catching fs exceptions, with 
backoff, to compensate. This can be done by making a wrapper around all fs 
operations that releases references to the old fs instance and makes and 
initializes a new instance to retry. All fs users would need to be fixed up to 
handle loss of state around fs wrapper invocations: hlog, memcache flusher, 
hstore, etc. 

Cases of clear unrecoverable failure (are there any?) should be excepted.

Once the fs wrapper is in place, error recovery scenarios can be tested by 
forcing reinitialization of the fs during PE or other test cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to