Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Ryan Rawson Sat, 28 Feb 2009 17:12:05 -0800

I don't think it's appropriate to die if the DFS is down.  This forces the
admin into an active recovery because all the regionservers went away.


I think regionservers should only die if they will never be able to continue
on in the future - DFS being down, master being down, zookeeper down, are
not unrecoverable errors.



On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA)
<[email protected]>wrote:

>
>     [
> https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Evgeny Ryabitskiy updated HBASE-1084:
> -------------------------------------
>
>    Attachment: HBASE-1084_HRegionServer.java.patch
>
> Change in   protected boolean checkFileSystem() to try reinitialize DFS
> first and if fails ShutDown.
>
> > Reinitializable DFS client
> > --------------------------
> >
> >                 Key: HBASE-1084
> >                 URL: https://issues.apache.org/jira/browse/HBASE-1084
> >             Project: Hadoop HBase
> >          Issue Type: Improvement
> >          Components: io, master, regionserver
> >            Reporter: Andrew Purtell
> >            Assignee: Evgeny Ryabitskiy
> >             Fix For: 0.20.0
> >
> >         Attachments: HBASE-1084_HRegionServer.java.patch
> >
> >
> > HBase is the only long lived DFS client. Tasks handle DFS errors by
> dying. HBase daemons do not and instead depend on dfsclient error recovery
> capability, but that is not sufficiently developed or tested. Several issues
> are a result:
> > * HBASE-846: hbase looses its mind when hdfs fills
> > * HBASE-879: When dfs restarts or moves blocks around, hbase
> regionservers don't notice
> > * HBASE-932: Regionserver restart
> > * HBASE-1078: "java.io.IOException: Could not obtain block": allthough
> file is there and accessible through the dfs client
> > * hlog indefinitely hung on getting new blocks from dfs on apurtell
> cluster
> > * regions closed due to transient DFS problems during loaded cluster
> restart
> > These issues might also be related:
> > * HBASE-15: Could not complete hdfs write out to flush file forcing
> regionserver restart
> > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk,
> DFSClient.java:2126, DataStreamer socketWrite
> > HBase should reinitialize the fs a few times upon catching fs exceptions,
> with backoff, to compensate. This can be done by making a wrapper around all
> fs operations that releases references to the old fs instance and makes and
> initializes a new instance to retry. All fs users would need to be fixed up
> to handle loss of state around fs wrapper invocations: hlog, memcache
> flusher, hstore, etc.
> > Cases of clear unrecoverable failure (are there any?) should be excepted.
> > Once the fs wrapper is in place, error recovery scenarios can be tested
> by forcing reinitialization of the fs during PE or other test cases.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Reply via email to