For #1, what string should I look for in region server log ? For #4, what's the rationale behind sending YADE after receiving heartbeat ? I thought heartbeat means the RS is alive.
Thanks On Sat, Jan 15, 2011 at 4:49 PM, Stack <st...@duboce.net> wrote: > FYI Ted, the YourAreDeadException usually happens in following context: > > 1. Regionserver has some kinda issue -- long GC pause for instance -- > and it stops tickling zk. > 2. Master gets zk session expired event. Starts up recovery of the hung > region. > 3. Regionserver recovers but has not yet processed its session expired > event. It heartbeats the Master as though nothing wrong. > 4. Master is mid-recovery or beyond server recovery and on receipt of > the heartbeat in essence tells the regionserver to 'go away' by > sending him the YouAreDeadException. > 5. By now the regionserver will have gotten its session expired > notification and will have started an abort so the YADE is not news > when it receives the exception. > > St.Ack > > > On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Thanks for your analysis, Ryan. > > The dev cluster has half as many nodes as our staging cluster. Each node > has > > half the number of cores as the node in staging. > > > > I agree with your conclusion. > > > > I will report back after I collect more data - the flow uses hbase > heavily > > toward the end. > > > > On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <ryano...@gmail.com> wrote: > > > >> I'm seeing not much in the way of errors, timeouts, all to one machine > >> ending with .80, so that is probably your failed node. > >> > >> Other than that, the log doesnt seem to say too much. Searching for > >> strings like FATAL and Exception is the way to go here. > >> > >> Also things like this: > >> 2011-01-14 23:38:52,936 INFO > >> org.apache.hadoop.hbase.master.AssignmentManager: Region has been > >> PENDING_OPEN for too long, reassigning region= > >> > >> > NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v > >> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602 > >> 7.2c40637c6c648a67162cc38d8c6d8ee9. > >> > >> > >> Guessing, I'd probably say your nodes hit some performance wall, with > >> io-wait, or networking, or something, and Regionserver processes > >> stopped responding, but did not time out from zookeeper yet... so you > >> would run into a situation where some nodes are unresponsive, so any > >> data hosted there would be difficult to talk to. Until the > >> regionserver times out it's zookeeper node, the master doesnt know > >> about the fault of the regionserver. > >> > >> The master web UI is probably inaccessible because the META table is > >> on a regionserver that went AWOL. You should check your load, your > >> ganglia graphs. Also remember, despite having lots of disks, each > >> node is a gigabit ethernet which means about 110-120 MB/sec. It's > >> quite possible you are running into network limitations, remember that > >> regionservers must write to 2 additional datanodes, and there will be > >> overlap, thus you have to share some of that 110-120MB/sec per node > >> figure with other nodes, not to mention that you also need to factor > >> inbound bandwidth (from client->hbase regionserver) and outbound > >> bandwidth (from datanode replica 1 -> dn replica 2). > >> > >> -ryan > >> > >> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > Now I cannot access master web UI, This happened after I doubled the > >> amount > >> > of data processed in our flow. > >> > > >> > I am attaching master log. > >> > > >> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryano...@gmail.com> > wrote: > >> >> > >> >> This is the cause: > >> >> > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > >> server > >> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, > >> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): > >> >> regionserver:60020-0x12d7b7b1c760004 > >> regionserver:60020-0x12d7b7b1c760004 > >> >> received expired from ZooKeeper, aborting > >> >> org.apache.zookeeper.KeeperException$SessionExpiredException: > >> >> > >> >> Why did the session expire? Typically it's GC, what does your GC > logs > >> >> say? Otherwise, network issues perhaps? Swapping? Other machine > >> >> related systems problems? > >> >> > >> >> -ryan > >> >> > >> >> > >> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> >> > I ran 0.90 RC3 in dev cluster. > >> >> > > >> >> > I saw the following in region server log: > >> >> > > >> >> > Caused by: org.apache.hadoop.ipc.RemoteException: > >> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > rejected; > >> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com > >> ,60020,1294856823378 > >> >> > as > >> >> > dead server > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648) > >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > >> >> > at > >> >> > > >> >> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> >> > at java.lang.reflect.Method.invoke(Method.java:597) > >> >> > at > >> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) > >> >> > > >> >> > at > >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) > >> >> > at > >> >> > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > >> >> > at $Proxy0.regionServerReport(Unknown Source) > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702) > >> >> > ... 2 more > >> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: > >> Initiating > >> >> > client connection, > >> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181 > >> sessionTimeout=90000 > >> >> > watcher=hconnection > >> >> > 2011-01-13 03:55:08,914 FATAL > >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > >> >> > server > >> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, > >> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): > >> >> > regionserver:60020-0x12d7b7b1c760004 > >> >> > regionserver:60020-0x12d7b7b1c760004 > >> >> > received expired from ZooKeeper, aborting > >> >> > org.apache.zookeeper.KeeperException$SessionExpiredException: > >> >> > KeeperErrorCode = Session expired > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) > >> >> > at > >> >> > > >> >> > > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) > >> >> > at > >> >> > > >> >> > > >> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > >> >> > at > >> >> > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > >> >> > > >> >> > --------------- > >> >> > > >> >> > And the following from master log: > >> >> > > >> >> > 2011-01-13 03:52:42,003 INFO > >> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > >> >> > ephemeral node deleted, processing expiration [ > >> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378] > >> >> > 2011-01-13 03:52:42,005 DEBUG > >> >> > org.apache.hadoop.hbase.master.ServerManager: > >> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead > >> >> > servers, > >> >> > submitted shutdown handler to be executed, root=false, meta=false > >> >> > 2011-01-13 03:52:42,005 INFO > >> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: > >> Splitting > >> >> > logs > >> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 > >> >> > 2011-01-13 03:52:42,092 INFO > >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 > >> >> > hlog(s) > >> >> > in hdfs:// > >> >> > > >> >> > > >> > sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 > >> >> > 2011-01-13 03:52:42,093 DEBUG > >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer > thread > >> >> > Thread[WriterThread-0,5,main]: starting > >> >> > 2011-01-13 03:52:42,094 DEBUG > >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer > thread > >> >> > Thread[WriterThread-1,5,main]: starting > >> >> > 2011-01-13 03:52:42,096 DEBUG > >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting > hlog > >> 1 > >> >> > of > >> >> > 1: hdfs:// > >> >> > > >> >> > > >> > sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407 > >> , > >> >> > length=0 > >> >> > > >> >> > Please advise what could be the cause. > >> >> > > >> >> > Thanks > >> >> > > >> > > >> > > >> > > >