For #1, I assume I should look for 'received expired from ZooKeeper,
aborting'

On Sat, Jan 15, 2011 at 5:02 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> For #1, what string should I look for in region server log ?
> For #4, what's the rationale behind sending YADE after receiving heartbeat
> ? I thought heartbeat means the RS is alive.
>
> Thanks
>
>
> On Sat, Jan 15, 2011 at 4:49 PM, Stack <st...@duboce.net> wrote:
>
>> FYI Ted, the YourAreDeadException usually happens in following context:
>>
>> 1. Regionserver has some kinda issue -- long GC pause for instance --
>> and it stops tickling zk.
>> 2. Master gets zk session expired event.  Starts up recovery of the hung
>> region.
>> 3. Regionserver recovers but has not yet processed its session expired
>> event.  It heartbeats the Master as though nothing wrong.
>> 4. Master is mid-recovery or beyond server recovery and on receipt of
>> the heartbeat in essence tells the regionserver to 'go away' by
>> sending him the YouAreDeadException.
>> 5. By now the regionserver will have gotten its session expired
>> notification and will have started an abort so the YADE is not news
>> when it receives the exception.
>>
>> St.Ack
>>
>>
>> On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> > Thanks for your analysis, Ryan.
>> > The dev cluster has half as many nodes as our staging cluster. Each node
>> has
>> > half the number of cores as the node in staging.
>> >
>> > I agree with your conclusion.
>> >
>> > I will report back after I collect more data - the flow uses hbase
>> heavily
>> > toward the end.
>> >
>> > On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <ryano...@gmail.com>
>> wrote:
>> >
>> >> I'm seeing not much in the way of errors, timeouts, all to one machine
>> >> ending with .80, so that is probably your failed node.
>> >>
>> >> Other than that, the log doesnt seem to say too much.  Searching for
>> >> strings like FATAL and Exception is the way to go here.
>> >>
>> >> Also things like this:
>> >> 2011-01-14 23:38:52,936 INFO
>> >> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>> >> PENDING_OPEN for too long, reassigning region=
>> >>
>> >>
>> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v
>> >> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
>> >> 7.2c40637c6c648a67162cc38d8c6d8ee9.
>> >>
>> >>
>> >> Guessing, I'd probably say your nodes hit some performance wall, with
>> >> io-wait, or networking, or something, and Regionserver processes
>> >> stopped responding, but did not time out from zookeeper yet... so you
>> >> would run into a situation where some nodes are unresponsive, so any
>> >> data hosted there would be difficult to talk to.  Until the
>> >> regionserver times out it's zookeeper node, the master doesnt know
>> >> about the fault of the regionserver.
>> >>
>> >> The master web UI is probably inaccessible because the META table is
>> >> on a regionserver that went AWOL.  You should check your load, your
>> >> ganglia graphs.  Also remember, despite having lots of disks, each
>> >> node is a gigabit ethernet which means about 110-120 MB/sec.  It's
>> >> quite possible you are running into network limitations, remember that
>> >> regionservers must write to 2 additional datanodes, and there will be
>> >> overlap, thus you have to share some of that 110-120MB/sec per node
>> >> figure with other nodes, not to mention that you also need to factor
>> >> inbound bandwidth (from client->hbase regionserver) and outbound
>> >> bandwidth (from datanode replica 1 -> dn replica 2).
>> >>
>> >> -ryan
>> >>
>> >> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >> > Now I cannot access master web UI, This happened after I doubled the
>> >> amount
>> >> > of data processed in our flow.
>> >> >
>> >> > I am attaching master log.
>> >> >
>> >> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryano...@gmail.com>
>> wrote:
>> >> >>
>> >> >> This is the cause:
>> >> >>
>> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> >> server
>> >> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>> >> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>> >> >> regionserver:60020-0x12d7b7b1c760004
>> >> regionserver:60020-0x12d7b7b1c760004
>> >> >> received expired from ZooKeeper, aborting
>> >> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> >> >>
>> >> >> Why did the session expire?  Typically it's GC, what does your GC
>> logs
>> >> >> say?  Otherwise, network issues perhaps?  Swapping?  Other machine
>> >> >> related systems problems?
>> >> >>
>> >> >> -ryan
>> >> >>
>> >> >>
>> >> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhih...@gmail.com>
>> wrote:
>> >> >> > I ran 0.90 RC3 in dev cluster.
>> >> >> >
>> >> >> > I saw the following in region server log:
>> >> >> >
>> >> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> >> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> rejected;
>> >> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com
>> >> ,60020,1294856823378
>> >> >> > as
>> >> >> > dead server
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
>> >> >> >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >> >> >    at java.lang.reflect.Method.invoke(Method.java:597)
>> >> >> >    at
>> >> >> >
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
>> >> >> >
>> >> >> >    at
>> >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
>> >> >> >    at
>> >> >> >
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>> >> >> >    at $Proxy0.regionServerReport(Unknown Source)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
>> >> >> >    ... 2 more
>> >> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper:
>> >> Initiating
>> >> >> > client connection,
>> >> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181
>> >> sessionTimeout=90000
>> >> >> > watcher=hconnection
>> >> >> > 2011-01-13 03:55:08,914 FATAL
>> >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
>> region
>> >> >> > server
>> >> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>> >> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>> >> >> > regionserver:60020-0x12d7b7b1c760004
>> >> >> > regionserver:60020-0x12d7b7b1c760004
>> >> >> > received expired from ZooKeeper, aborting
>> >> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
>> >> >> > KeeperErrorCode = Session expired
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>> >> >> >    at
>> >> >> >
>> >> >> >
>> >>
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>> >> >> >    at
>> >> >> >
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
>> >> >> >
>> >> >> > ---------------
>> >> >> >
>> >> >> > And the following from master log:
>> >> >> >
>> >> >> > 2011-01-13 03:52:42,003 INFO
>> >> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
>> RegionServer
>> >> >> > ephemeral node deleted, processing expiration [
>> >> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
>> >> >> > 2011-01-13 03:52:42,005 DEBUG
>> >> >> > org.apache.hadoop.hbase.master.ServerManager:
>> >> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead
>> >> >> > servers,
>> >> >> > submitted shutdown handler to be executed, root=false, meta=false
>> >> >> > 2011-01-13 03:52:42,005 INFO
>> >> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> >> Splitting
>> >> >> > logs
>> >> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> >> >> > 2011-01-13 03:52:42,092 INFO
>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1
>> >> >> > hlog(s)
>> >> >> > in hdfs://
>> >> >> >
>> >> >> >
>> >>
>> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> >> >> > 2011-01-13 03:52:42,093 DEBUG
>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer
>> thread
>> >> >> > Thread[WriterThread-0,5,main]: starting
>> >> >> > 2011-01-13 03:52:42,094 DEBUG
>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer
>> thread
>> >> >> > Thread[WriterThread-1,5,main]: starting
>> >> >> > 2011-01-13 03:52:42,096 DEBUG
>> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
>> hlog
>> >> 1
>> >> >> > of
>> >> >> > 1: hdfs://
>> >> >> >
>> >> >> >
>> >>
>> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407
>> >> ,
>> >> >> > length=0
>> >> >> >
>> >> >> > Please advise what could be the cause.
>> >> >> >
>> >> >> > Thanks
>> >> >> >
>> >> >
>> >> >
>> >>
>> >
>>
>
>

Reply via email to