RE: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Jim Kellerman (POWERSET) Sat, 28 Feb 2009 23:24:11 -0800

Ryan,

If the DFS is down, what should we do?


Once we have Zookeeper doing a lot of the current master duties,
then having the master going down is not a cluster killing event.

If we lose Zookeeper quorum, how do we recover?

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Ryan Rawson [mailto:[email protected]]
> Sent: Saturday, February 28, 2009 5:12 PM
> To: [email protected]
> Subject: Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client
> 
> I don't think it's appropriate to die if the DFS is down.  This forces the
> admin into an active recovery because all the regionservers went away.
> 
> I think regionservers should only die if they will never be able to
> continue
> on in the future - DFS being down, master being down, zookeeper down, are
> not unrecoverable errors.
> 
> 
> 
> On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA)
> <[email protected]>wrote:
> 
> >
> >     [
> > https://issues.apache.org/jira/browse/HBASE-
> 1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Evgeny Ryabitskiy updated HBASE-1084:
> > -------------------------------------
> >
> >    Attachment: HBASE-1084_HRegionServer.java.patch
> >
> > Change in   protected boolean checkFileSystem() to try reinitialize DFS
> > first and if fails ShutDown.
> >
> > > Reinitializable DFS client
> > > --------------------------
> > >
> > >                 Key: HBASE-1084
> > >                 URL: https://issues.apache.org/jira/browse/HBASE-1084
> > >             Project: Hadoop HBase
> > >          Issue Type: Improvement
> > >          Components: io, master, regionserver
> > >            Reporter: Andrew Purtell
> > >            Assignee: Evgeny Ryabitskiy
> > >             Fix For: 0.20.0
> > >
> > >         Attachments: HBASE-1084_HRegionServer.java.patch
> > >
> > >
> > > HBase is the only long lived DFS client. Tasks handle DFS errors by
> > dying. HBase daemons do not and instead depend on dfsclient error
> recovery
> > capability, but that is not sufficiently developed or tested. Several
> issues
> > are a result:
> > > * HBASE-846: hbase looses its mind when hdfs fills
> > > * HBASE-879: When dfs restarts or moves blocks around, hbase
> > regionservers don't notice
> > > * HBASE-932: Regionserver restart
> > > * HBASE-1078: "java.io.IOException: Could not obtain block": allthough
> > file is there and accessible through the dfs client
> > > * hlog indefinitely hung on getting new blocks from dfs on apurtell
> > cluster
> > > * regions closed due to transient DFS problems during loaded cluster
> > restart
> > > These issues might also be related:
> > > * HBASE-15: Could not complete hdfs write out to flush file forcing
> > regionserver restart
> > > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk,
> > DFSClient.java:2126, DataStreamer socketWrite
> > > HBase should reinitialize the fs a few times upon catching fs
> exceptions,
> > with backoff, to compensate. This can be done by making a wrapper around
> all
> > fs operations that releases references to the old fs instance and makes
> and
> > initializes a new instance to retry. All fs users would need to be fixed
> up
> > to handle loss of state around fs wrapper invocations: hlog, memcache
> > flusher, hstore, etc.
> > > Cases of clear unrecoverable failure (are there any?) should be
> excepted.
> > > Once the fs wrapper is in place, error recovery scenarios can be
> tested
> > by forcing reinitialization of the fs during PE or other test cases.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >

RE: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Reply via email to