I don't think it's appropriate to die if the DFS is down. This forces the admin into an active recovery because all the regionservers went away.
I think regionservers should only die if they will never be able to continue on in the future - DFS being down, master being down, zookeeper down, are not unrecoverable errors. On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA) <[email protected]>wrote: > > [ > https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Evgeny Ryabitskiy updated HBASE-1084: > ------------------------------------- > > Attachment: HBASE-1084_HRegionServer.java.patch > > Change in protected boolean checkFileSystem() to try reinitialize DFS > first and if fails ShutDown. > > > Reinitializable DFS client > > -------------------------- > > > > Key: HBASE-1084 > > URL: https://issues.apache.org/jira/browse/HBASE-1084 > > Project: Hadoop HBase > > Issue Type: Improvement > > Components: io, master, regionserver > > Reporter: Andrew Purtell > > Assignee: Evgeny Ryabitskiy > > Fix For: 0.20.0 > > > > Attachments: HBASE-1084_HRegionServer.java.patch > > > > > > HBase is the only long lived DFS client. Tasks handle DFS errors by > dying. HBase daemons do not and instead depend on dfsclient error recovery > capability, but that is not sufficiently developed or tested. Several issues > are a result: > > * HBASE-846: hbase looses its mind when hdfs fills > > * HBASE-879: When dfs restarts or moves blocks around, hbase > regionservers don't notice > > * HBASE-932: Regionserver restart > > * HBASE-1078: "java.io.IOException: Could not obtain block": allthough > file is there and accessible through the dfs client > > * hlog indefinitely hung on getting new blocks from dfs on apurtell > cluster > > * regions closed due to transient DFS problems during loaded cluster > restart > > These issues might also be related: > > * HBASE-15: Could not complete hdfs write out to flush file forcing > regionserver restart > > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk, > DFSClient.java:2126, DataStreamer socketWrite > > HBase should reinitialize the fs a few times upon catching fs exceptions, > with backoff, to compensate. This can be done by making a wrapper around all > fs operations that releases references to the old fs instance and makes and > initializes a new instance to retry. All fs users would need to be fixed up > to handle loss of state around fs wrapper invocations: hlog, memcache > flusher, hstore, etc. > > Cases of clear unrecoverable failure (are there any?) should be excepted. > > Once the fs wrapper is in place, error recovery scenarios can be tested > by forcing reinitialization of the fs during PE or other test cases. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
