Ryan, If the DFS is down, what should we do?
Once we have Zookeeper doing a lot of the current master duties, then having the master going down is not a cluster killing event. If we lose Zookeeper quorum, how do we recover? --- Jim Kellerman, Powerset (Live Search, Microsoft Corporation) > -----Original Message----- > From: Ryan Rawson [mailto:[email protected]] > Sent: Saturday, February 28, 2009 5:12 PM > To: [email protected] > Subject: Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client > > I don't think it's appropriate to die if the DFS is down. This forces the > admin into an active recovery because all the regionservers went away. > > I think regionservers should only die if they will never be able to > continue > on in the future - DFS being down, master being down, zookeeper down, are > not unrecoverable errors. > > > > On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA) > <[email protected]>wrote: > > > > > [ > > https://issues.apache.org/jira/browse/HBASE- > 1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > > > Evgeny Ryabitskiy updated HBASE-1084: > > ------------------------------------- > > > > Attachment: HBASE-1084_HRegionServer.java.patch > > > > Change in protected boolean checkFileSystem() to try reinitialize DFS > > first and if fails ShutDown. > > > > > Reinitializable DFS client > > > -------------------------- > > > > > > Key: HBASE-1084 > > > URL: https://issues.apache.org/jira/browse/HBASE-1084 > > > Project: Hadoop HBase > > > Issue Type: Improvement > > > Components: io, master, regionserver > > > Reporter: Andrew Purtell > > > Assignee: Evgeny Ryabitskiy > > > Fix For: 0.20.0 > > > > > > Attachments: HBASE-1084_HRegionServer.java.patch > > > > > > > > > HBase is the only long lived DFS client. Tasks handle DFS errors by > > dying. HBase daemons do not and instead depend on dfsclient error > recovery > > capability, but that is not sufficiently developed or tested. Several > issues > > are a result: > > > * HBASE-846: hbase looses its mind when hdfs fills > > > * HBASE-879: When dfs restarts or moves blocks around, hbase > > regionservers don't notice > > > * HBASE-932: Regionserver restart > > > * HBASE-1078: "java.io.IOException: Could not obtain block": allthough > > file is there and accessible through the dfs client > > > * hlog indefinitely hung on getting new blocks from dfs on apurtell > > cluster > > > * regions closed due to transient DFS problems during loaded cluster > > restart > > > These issues might also be related: > > > * HBASE-15: Could not complete hdfs write out to flush file forcing > > regionserver restart > > > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk, > > DFSClient.java:2126, DataStreamer socketWrite > > > HBase should reinitialize the fs a few times upon catching fs > exceptions, > > with backoff, to compensate. This can be done by making a wrapper around > all > > fs operations that releases references to the old fs instance and makes > and > > initializes a new instance to retry. All fs users would need to be fixed > up > > to handle loss of state around fs wrapper invocations: hlog, memcache > > flusher, hstore, etc. > > > Cases of clear unrecoverable failure (are there any?) should be > excepted. > > > Once the fs wrapper is in place, error recovery scenarios can be > tested > > by forcing reinitialization of the fs during PE or other test cases. > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > >
