Ok so that region server must have been holding .META., you will have to restart HBase.
Sorry J-D On Tue, Feb 24, 2009 at 11:27 AM, Michael Dagaev <[email protected]>wrote: > Sorry, I mean that some requests fail when a region server is down in > Hbase 0.18.1, > which we are using now. > > Besides, when I started the stopped region server and stopped another one, > not only "old" requests were stuck because of retries but new requests > (e.g. > issued by hbase shell) fail too. > > The master.jsp also fails with > > Trying to contact region server <...>:60020 for region .META.,,1, row > '', but failed after 10 attempts. > Exceptions: java.io.IOException: Call failed on local exception > > Thank you for your cooperation, > M. > > On Tue, Feb 24, 2009 at 6:06 PM, Jean-Daniel Cryans <[email protected]> > wrote: > > As I wrote, you should upgrade to 0.18 branch in SVN. > > > > J-D > > > > On Tue, Feb 24, 2009 at 11:04 AM, Michael Dagaev > > <[email protected]>wrote: > > > >> I do not if it was holding ROOT or META region. > >> It looks like requests may fail in Hbase 0.18 if a region server stops. > >> > >> Thanks, > >> M. > >> > >> On Tue, Feb 24, 2009 at 5:40 PM, Jean-Daniel Cryans < > [email protected]> > >> wrote: > >> > Well this should not happen like that. Was the region server holding > the > >> > ROOT or META region? If so, well that's a bug corrected in 0.19.0 and > >> > branch-0.18. I suggest you upgrade to that version if you don't want > to > >> > break your MR jobs. > >> > > >> > J-D > >> > > >> > On Tue, Feb 24, 2009 at 10:33 AM, Michael Dagaev > >> > <[email protected]>wrote: > >> > > >> >> What I see now is that the client gets an exception (see below) once > a > >> >> region servers stops: > >> >> > >> >> org.apache.hadoop.hbase.client.NoServerForRegionException: No server > >> >> address listed in .META. > >> >> ... > >> >> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: > >> >> Trying to contact region server <region server>:60020 for region > >> >> > >> >> I guess the exception occurred since the region server is down. Is it > >> >> correct? > >> >> > >> >> Thank you for your cooperation, > >> >> M. > >> >> > >> >> P. S. We are running version 0.18.1 > >> >> > >> >> On Tue, Feb 24, 2009 at 5:07 PM, Jean-Daniel Cryans < > >> [email protected]> > >> >> wrote: > >> >> > Correcting myself, no waiting time regards the time to figure the > node > >> is > >> >> > dead. It will still have to fetch the region location in META. > >> >> > > >> >> > J-D > >> >> > > >> >> > > >> >> > On Tue, Feb 24, 2009 at 10:02 AM, Jean-Daniel Cryans < > >> >> [email protected]>wrote: > >> >> > > >> >> >> Well if a region server dies instead of being cleanly shut down, > it > >> >> takes > >> >> >> in the worst case 180 seconds (a region server lease length) > before > >> the > >> >> >> Master reassigns the regions. Clients trying to connect to that > >> server > >> >> will > >> >> >> take IIRC 10 seconds to figure the node is down then the time to > >> >> communicate > >> >> >> with ROOT and META is under 1 sec. If META wasn't updated yet, it > >> will > >> >> retry > >> >> >> all of that. > >> >> >> > >> >> >> In the next release (0.20.0), the master is notified by Zookeeper > in > >> the > >> >> >> following seconds of a region server death and will proceed to > >> reassign > >> >> the > >> >> >> regions immediately. > >> >> >> > >> >> >> If the client don't have the region in cache and META is updated > with > >> >> the > >> >> >> region server death, there will be no waiting time. > >> >> >> > >> >> >> J-D > >> >> >> > >> >> >> > >> >> >> On Tue, Feb 24, 2009 at 9:49 AM, Michael Dagaev < > >> >> [email protected]>wrote: > >> >> >> > >> >> >>> Thanks, now it is clear. > >> >> >>> > >> >> >>> However, if a region server is down, it takes a lot of time to > retry > >> >> >>> first, > >> >> >>> to rescan the META region when the retries fail, rescan ROOT, > etc. > >> to > >> >> >>> get eventually to another region server, which will handle the > >> request. > >> >> >>> Is it correct ? > >> >> >>> > >> >> >>> On Tue, Feb 24, 2009 at 4:36 PM, Jean-Daniel Cryans < > >> >> [email protected]> > >> >> >>> wrote: > >> >> >>> > This is why we have a META table, it holds the location info. > See > >> >> >>> > http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture#client > >> >> >>> > > >> >> >>> > J-D > >> >> >>> > > >> >> >>> > On Tue, Feb 24, 2009 at 9:28 AM, Michael Dagaev < > >> >> >>> [email protected]>wrote: > >> >> >>> > > >> >> >>> >> Thanks, Jean-Daniel. > >> >> >>> >> > >> >> >>> >> I did run hbase-daemon stop regionserver and start > regionserver > >> >> >>> >> and saw the client retrying to connect to the restarted region > >> >> server. > >> >> >>> >> > >> >> >>> >> How does it know to connect to another region server ? Maybe > it > >> >> stops > >> >> >>> >> retrying, asks master, and get another region server to > connect > >> to. > >> >> >>> >> Is it correct ? > >> >> >>> >> > >> >> >>> >> Thank you for your cooperation, > >> >> >>> >> M. > >> >> >>> >> > >> >> >>> >> On Tue, Feb 24, 2009 at 3:56 PM, Jean-Daniel Cryans < > >> >> >>> [email protected]> > >> >> >>> >> wrote: > >> >> >>> >> > Michael, > >> >> >>> >> > > >> >> >>> >> > Regards stopping those nodes, do it using > >> >> hadoop-daemon/hbase-daemon > >> >> >>> to > >> >> >>> >> stop > >> >> >>> >> > them cleanly. Requests from the clients will not "fail", > they > >> will > >> >> >>> simply > >> >> >>> >> be > >> >> >>> >> > told to look elsewhere for the regions they have in cache. > >> Unless > >> >> you > >> >> >>> >> only > >> >> >>> >> > have 1 region server... > >> >> >>> >> > > >> >> >>> >> > Regards starting the nodes, apart from the usual > >> >> >>> >> hadoop-daemon/hbase-daemon, > >> >> >>> >> > no. > >> >> >>> >> > > >> >> >>> >> > J-D > >> >> >>> >> > > >> >> >>> >> > On Tue, Feb 24, 2009 at 8:50 AM, Michael Dagaev < > >> >> >>> >> [email protected]>wrote: > >> >> >>> >> > > >> >> >>> >> >> Hi, all > >> >> >>> >> >> > >> >> >>> >> >> As I understand, I can stop a region server and a data > >> node > >> >> in a > >> >> >>> >> >> cluster > >> >> >>> >> >> "semi-transparently" for clients, i. e. the requests > handled > >> by > >> >> the > >> >> >>> >> >> region server > >> >> >>> >> >> at that time will fail, but cluster will be working. > >> >> >>> >> >> > >> >> >>> >> >> If I start the data node and region server I do not have > to > >> do > >> >> >>> anything > >> >> >>> >> to > >> >> >>> >> >> make > >> >> >>> >> >> them work. > >> >> >>> >> >> > >> >> >>> >> >> Is it correct ? > >> >> >>> >> >> > >> >> >>> >> >> Thank you for your cooperation, > >> >> >>> >> >> M. > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > >> >> >>> > > >> >> >>> > >> >> >> > >> >> >> > >> >> > > >> >> > >> > > >> > > >
