Yes.

Currently, there are two heartbeats: the zk client one and then the
hbase which used to be what we relied on figuring whether a
regionserver is alive but now its just used to post the master the
regionserver stats such as requests per second.  This latter is going
away in 0.92 (Pre-0.90.0 regionserver and master would swap 'messages'
on the back of the heartbeat -- i.e. open this region, I've just split
region X, etc. but now 90% of this stuff is done via zk.  In 0.92.
we'll finish the cleanup).

Hope this helps,
St.Ack

On Sat, Jan 15, 2011 at 5:03 PM, Ted Yu <[email protected]> wrote:
> For #1, I assume I should look for 'received expired from ZooKeeper,
> aborting'
>
> On Sat, Jan 15, 2011 at 5:02 PM, Ted Yu <[email protected]> wrote:
>
>> For #1, what string should I look for in region server log ?
>> For #4, what's the rationale behind sending YADE after receiving heartbeat
>> ? I thought heartbeat means the RS is alive.
>>
>> Thanks
>>
>>
>> On Sat, Jan 15, 2011 at 4:49 PM, Stack <[email protected]> wrote:
>>
>>> FYI Ted, the YourAreDeadException usually happens in following context:
>>>
>>> 1. Regionserver has some kinda issue -- long GC pause for instance --
>>> and it stops tickling zk.
>>> 2. Master gets zk session expired event.  Starts up recovery of the hung
>>> region.
>>> 3. Regionserver recovers but has not yet processed its session expired
>>> event.  It heartbeats the Master as though nothing wrong.
>>> 4. Master is mid-recovery or beyond server recovery and on receipt of
>>> the heartbeat in essence tells the regionserver to 'go away' by
>>> sending him the YouAreDeadException.
>>> 5. By now the regionserver will have gotten its session expired
>>> notification and will have started an abort so the YADE is not news
>>> when it receives the exception.
>>>
>>> St.Ack
>>>
>>>
>>> On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu <[email protected]> wrote:
>>> > Thanks for your analysis, Ryan.
>>> > The dev cluster has half as many nodes as our staging cluster. Each node
>>> has
>>> > half the number of cores as the node in staging.
>>> >
>>> > I agree with your conclusion.
>>> >
>>> > I will report back after I collect more data - the flow uses hbase
>>> heavily
>>> > toward the end.
>>> >
>>> > On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <[email protected]>
>>> wrote:
>>> >
>>> >> I'm seeing not much in the way of errors, timeouts, all to one machine
>>> >> ending with .80, so that is probably your failed node.
>>> >>
>>> >> Other than that, the log doesnt seem to say too much.  Searching for
>>> >> strings like FATAL and Exception is the way to go here.
>>> >>
>>> >> Also things like this:
>>> >> 2011-01-14 23:38:52,936 INFO
>>> >> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>>> >> PENDING_OPEN for too long, reassigning region=
>>> >>
>>> >>
>>> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v
>>> >> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
>>> >> 7.2c40637c6c648a67162cc38d8c6d8ee9.
>>> >>
>>> >>
>>> >> Guessing, I'd probably say your nodes hit some performance wall, with
>>> >> io-wait, or networking, or something, and Regionserver processes
>>> >> stopped responding, but did not time out from zookeeper yet... so you
>>> >> would run into a situation where some nodes are unresponsive, so any
>>> >> data hosted there would be difficult to talk to.  Until the
>>> >> regionserver times out it's zookeeper node, the master doesnt know
>>> >> about the fault of the regionserver.
>>> >>
>>> >> The master web UI is probably inaccessible because the META table is
>>> >> on a regionserver that went AWOL.  You should check your load, your
>>> >> ganglia graphs.  Also remember, despite having lots of disks, each
>>> >> node is a gigabit ethernet which means about 110-120 MB/sec.  It's
>>> >> quite possible you are running into network limitations, remember that
>>> >> regionservers must write to 2 additional datanodes, and there will be
>>> >> overlap, thus you have to share some of that 110-120MB/sec per node
>>> >> figure with other nodes, not to mention that you also need to factor
>>> >> inbound bandwidth (from client->hbase regionserver) and outbound
>>> >> bandwidth (from datanode replica 1 -> dn replica 2).
>>> >>
>>> >> -ryan
>>> >>
>>> >> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <[email protected]> wrote:
>>> >> > Now I cannot access master web UI, This happened after I doubled the
>>> >> amount
>>> >> > of data processed in our flow.
>>> >> >
>>> >> > I am attaching master log.
>>> >> >
>>> >> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <[email protected]>
>>> wrote:
>>> >> >>
>>> >> >> This is the cause:
>>> >> >>
>>> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>> >> server
>>> >> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>>> >> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>>> >> >> regionserver:60020-0x12d7b7b1c760004
>>> >> regionserver:60020-0x12d7b7b1c760004
>>> >> >> received expired from ZooKeeper, aborting
>>> >> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> >> >>
>>> >> >> Why did the session expire?  Typically it's GC, what does your GC
>>> logs
>>> >> >> say?  Otherwise, network issues perhaps?  Swapping?  Other machine
>>> >> >> related systems problems?
>>> >> >>
>>> >> >> -ryan
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <[email protected]>
>>> wrote:
>>> >> >> > I ran 0.90 RC3 in dev cluster.
>>> >> >> >
>>> >> >> > I saw the following in region server log:
>>> >> >> >
>>> >> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
>>> >> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> rejected;
>>> >> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com
>>> >> ,60020,1294856823378
>>> >> >> > as
>>> >> >> > dead server
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
>>> >> >> >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> >> >> >    at java.lang.reflect.Method.invoke(Method.java:597)
>>> >> >> >    at
>>> >> >> >
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
>>> >> >> >
>>> >> >> >    at
>>> >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
>>> >> >> >    at
>>> >> >> >
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>>> >> >> >    at $Proxy0.regionServerReport(Unknown Source)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
>>> >> >> >    ... 2 more
>>> >> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper:
>>> >> Initiating
>>> >> >> > client connection,
>>> >> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181
>>> >> sessionTimeout=90000
>>> >> >> > watcher=hconnection
>>> >> >> > 2011-01-13 03:55:08,914 FATAL
>>> >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
>>> region
>>> >> >> > server
>>> >> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>>> >> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>>> >> >> > regionserver:60020-0x12d7b7b1c760004
>>> >> >> > regionserver:60020-0x12d7b7b1c760004
>>> >> >> > received expired from ZooKeeper, aborting
>>> >> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> >> >> > KeeperErrorCode = Session expired
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>>> >> >> >    at
>>> >> >> >
>>> >> >> >
>>> >>
>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>>> >> >> >    at
>>> >> >> >
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
>>> >> >> >
>>> >> >> > ---------------
>>> >> >> >
>>> >> >> > And the following from master log:
>>> >> >> >
>>> >> >> > 2011-01-13 03:52:42,003 INFO
>>> >> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
>>> RegionServer
>>> >> >> > ephemeral node deleted, processing expiration [
>>> >> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
>>> >> >> > 2011-01-13 03:52:42,005 DEBUG
>>> >> >> > org.apache.hadoop.hbase.master.ServerManager:
>>> >> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead
>>> >> >> > servers,
>>> >> >> > submitted shutdown handler to be executed, root=false, meta=false
>>> >> >> > 2011-01-13 03:52:42,005 INFO
>>> >> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> >> Splitting
>>> >> >> > logs
>>> >> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>>> >> >> > 2011-01-13 03:52:42,092 INFO
>>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1
>>> >> >> > hlog(s)
>>> >> >> > in hdfs://
>>> >> >> >
>>> >> >> >
>>> >>
>>> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>>> >> >> > 2011-01-13 03:52:42,093 DEBUG
>>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer
>>> thread
>>> >> >> > Thread[WriterThread-0,5,main]: starting
>>> >> >> > 2011-01-13 03:52:42,094 DEBUG
>>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer
>>> thread
>>> >> >> > Thread[WriterThread-1,5,main]: starting
>>> >> >> > 2011-01-13 03:52:42,096 DEBUG
>>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
>>> hlog
>>> >> 1
>>> >> >> > of
>>> >> >> > 1: hdfs://
>>> >> >> >
>>> >> >> >
>>> >>
>>> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407
>>> >> ,
>>> >> >> > length=0
>>> >> >> >
>>> >> >> > Please advise what could be the cause.
>>> >> >> >
>>> >> >> > Thanks
>>> >> >> >
>>> >> >
>>> >> >
>>> >>
>>> >
>>>
>>
>>
>

Reply via email to