Re: Question on region server/data node restart

Michael Dagaev Tue, 24 Feb 2009 08:33:45 -0800

No problem :)


On Tue, Feb 24, 2009 at 6:30 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
> Ok so that region server must have been holding .META., you will have to
> restart HBase.
>
> Sorry
>
> J-D
>
> On Tue, Feb 24, 2009 at 11:27 AM, Michael Dagaev
> <michael.dag...@gmail.com>wrote:
>
>> Sorry, I mean that some requests fail when a region server is down in
>> Hbase 0.18.1,
>> which we are using now.
>>
>> Besides, when I started the stopped region server and stopped another one,
>> not only "old" requests were stuck because of retries but new requests
>> (e.g.
>> issued by hbase shell) fail too.
>>
>> The master.jsp also fails with
>>
>> Trying to contact region server <...>:60020 for region .META.,,1, row
>> '', but failed after 10 attempts.
>> Exceptions: java.io.IOException: Call failed on local exception
>>
>> Thank you for your cooperation,
>> M.
>>
>> On Tue, Feb 24, 2009 at 6:06 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>> wrote:
>> > As I wrote, you should upgrade to 0.18 branch in SVN.
>> >
>> > J-D
>> >
>> > On Tue, Feb 24, 2009 at 11:04 AM, Michael Dagaev
>> > <michael.dag...@gmail.com>wrote:
>> >
>> >> I do not if it was holding ROOT or META region.
>> >> It looks like requests may fail in Hbase 0.18 if a region server stops.
>> >>
>> >> Thanks,
>> >> M.
>> >>
>> >> On Tue, Feb 24, 2009 at 5:40 PM, Jean-Daniel Cryans <
>> jdcry...@apache.org>
>> >> wrote:
>> >> > Well this should not happen like that. Was the region server holding
>> the
>> >> > ROOT or META region? If so, well that's a bug corrected in 0.19.0 and
>> >> > branch-0.18. I suggest you upgrade to that version if you don't want
>> to
>> >> > break your MR jobs.
>> >> >
>> >> > J-D
>> >> >
>> >> > On Tue, Feb 24, 2009 at 10:33 AM, Michael Dagaev
>> >> > <michael.dag...@gmail.com>wrote:
>> >> >
>> >> >> What I see now is that the client gets an exception (see below) once
>> a
>> >> >> region servers stops:
>> >> >>
>> >> >> org.apache.hadoop.hbase.client.NoServerForRegionException: No server
>> >> >> address listed in .META.
>> >> >> ...
>> >> >> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException:
>> >> >> Trying to contact region server <region server>:60020 for region
>> >> >>
>> >> >> I guess the exception occurred since the region server is down. Is it
>> >> >> correct?
>> >> >>
>> >> >> Thank you for your cooperation,
>> >> >> M.
>> >> >>
>> >> >> P. S. We are running version 0.18.1
>> >> >>
>> >> >> On Tue, Feb 24, 2009 at 5:07 PM, Jean-Daniel Cryans <
>> >> jdcry...@apache.org>
>> >> >> wrote:
>> >> >> > Correcting myself, no waiting time regards the time to figure the
>> node
>> >> is
>> >> >> > dead. It will still have to fetch the region location in META.
>> >> >> >
>> >> >> > J-D
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Feb 24, 2009 at 10:02 AM, Jean-Daniel Cryans <
>> >> >> jdcry...@apache.org>wrote:
>> >> >> >
>> >> >> >> Well if a region server dies instead of being cleanly shut down,
>> it
>> >> >> takes
>> >> >> >> in the worst case 180 seconds (a region server lease length)
>> before
>> >> the
>> >> >> >> Master reassigns the regions. Clients trying to connect to that
>> >> server
>> >> >> will
>> >> >> >> take IIRC 10 seconds to figure the node is down then the time to
>> >> >> communicate
>> >> >> >> with ROOT and META is under 1 sec. If META wasn't updated yet, it
>> >> will
>> >> >> retry
>> >> >> >> all of that.
>> >> >> >>
>> >> >> >> In the next release (0.20.0), the master is notified by Zookeeper
>> in
>> >> the
>> >> >> >> following seconds of a region server death and will proceed to
>> >> reassign
>> >> >> the
>> >> >> >> regions immediately.
>> >> >> >>
>> >> >> >> If the client don't have the region in cache and META is updated
>> with
>> >> >> the
>> >> >> >> region server death, there will be no waiting time.
>> >> >> >>
>> >> >> >> J-D
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Feb 24, 2009 at 9:49 AM, Michael Dagaev <
>> >> >> michael.dag...@gmail.com>wrote:
>> >> >> >>
>> >> >> >>> Thanks, now it is clear.
>> >> >> >>>
>> >> >> >>> However, if a region server is down, it takes a lot of time to
>> retry
>> >> >> >>> first,
>> >> >> >>> to rescan the META region when the retries fail, rescan ROOT,
>> etc.
>> >> to
>> >> >> >>> get eventually to another region server, which will handle the
>> >> request.
>> >> >> >>> Is it correct ?
>> >> >> >>>
>> >> >> >>> On Tue, Feb 24, 2009 at 4:36 PM, Jean-Daniel Cryans <
>> >> >> jdcry...@apache.org>
>> >> >> >>> wrote:
>> >> >> >>> > This is why we have a META table, it holds the location info.
>> See
>> >> >> >>> > http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture#client
>> >> >> >>> >
>> >> >> >>> > J-D
>> >> >> >>> >
>> >> >> >>> > On Tue, Feb 24, 2009 at 9:28 AM, Michael Dagaev <
>> >> >> >>> michael.dag...@gmail.com>wrote:
>> >> >> >>> >
>> >> >> >>> >> Thanks, Jean-Daniel.
>> >> >> >>> >>
>> >> >> >>> >> I did run hbase-daemon stop regionserver and start
>> regionserver
>> >> >> >>> >> and saw the client retrying to connect to the restarted region
>> >> >> server.
>> >> >> >>> >>
>> >> >> >>> >> How does it know to connect to another region server ? Maybe
>> it
>> >> >> stops
>> >> >> >>> >> retrying, asks master, and get another region server to
>> connect
>> >> to.
>> >> >> >>> >> Is it correct ?
>> >> >> >>> >>
>> >> >> >>> >> Thank you for your cooperation,
>> >> >> >>> >> M.
>> >> >> >>> >>
>> >> >> >>> >> On Tue, Feb 24, 2009 at 3:56 PM, Jean-Daniel Cryans <
>> >> >> >>> jdcry...@apache.org>
>> >> >> >>> >> wrote:
>> >> >> >>> >> > Michael,
>> >> >> >>> >> >
>> >> >> >>> >> > Regards stopping those nodes, do it using
>> >> >> hadoop-daemon/hbase-daemon
>> >> >> >>> to
>> >> >> >>> >> stop
>> >> >> >>> >> > them cleanly. Requests from the clients will not "fail",
>> they
>> >> will
>> >> >> >>> simply
>> >> >> >>> >> be
>> >> >> >>> >> > told to look elsewhere for the regions they have in cache.
>> >> Unless
>> >> >> you
>> >> >> >>> >> only
>> >> >> >>> >> > have 1 region server...
>> >> >> >>> >> >
>> >> >> >>> >> > Regards starting the nodes, apart from the usual
>> >> >> >>> >> hadoop-daemon/hbase-daemon,
>> >> >> >>> >> > no.
>> >> >> >>> >> >
>> >> >> >>> >> > J-D
>> >> >> >>> >> >
>> >> >> >>> >> > On Tue, Feb 24, 2009 at 8:50 AM, Michael Dagaev <
>> >> >> >>> >> michael.dag...@gmail.com>wrote:
>> >> >> >>> >> >
>> >> >> >>> >> >> Hi, all
>> >> >> >>> >> >>
>> >> >> >>> >> >>     As I understand, I can stop a region server and a data
>> >> node
>> >> >> in a
>> >> >> >>> >> >> cluster
>> >> >> >>> >> >> "semi-transparently" for clients, i. e. the requests
>> handled
>> >>  by
>> >> >> the
>> >> >> >>> >> >> region server
>> >> >> >>> >> >> at that time will fail, but cluster will be working.
>> >> >> >>> >> >>
>> >> >> >>> >> >> If I start the data node and region server  I do not have
>> to
>> >> do
>> >> >> >>> anything
>> >> >> >>> >> to
>> >> >> >>> >> >> make
>> >> >> >>> >> >> them work.
>> >> >> >>> >> >>
>> >> >> >>> >> >> Is it correct ?
>> >> >> >>> >> >>
>> >> >> >>> >> >> Thank you for your cooperation,
>> >> >> >>> >> >> M.
>> >> >> >>> >> >>
>> >> >> >>> >> >
>> >> >> >>> >>
>> >> >> >>> >
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: Question on region server/data node restart

Reply via email to