For #1, I assume I should look for 'received expired from ZooKeeper, aborting'
On Sat, Jan 15, 2011 at 5:02 PM, Ted Yu <yuzhih...@gmail.com> wrote: > For #1, what string should I look for in region server log ? > For #4, what's the rationale behind sending YADE after receiving heartbeat > ? I thought heartbeat means the RS is alive. > > Thanks > > > On Sat, Jan 15, 2011 at 4:49 PM, Stack <st...@duboce.net> wrote: > >> FYI Ted, the YourAreDeadException usually happens in following context: >> >> 1. Regionserver has some kinda issue -- long GC pause for instance -- >> and it stops tickling zk. >> 2. Master gets zk session expired event. Starts up recovery of the hung >> region. >> 3. Regionserver recovers but has not yet processed its session expired >> event. It heartbeats the Master as though nothing wrong. >> 4. Master is mid-recovery or beyond server recovery and on receipt of >> the heartbeat in essence tells the regionserver to 'go away' by >> sending him the YouAreDeadException. >> 5. By now the regionserver will have gotten its session expired >> notification and will have started an abort so the YADE is not news >> when it receives the exception. >> >> St.Ack >> >> >> On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> > Thanks for your analysis, Ryan. >> > The dev cluster has half as many nodes as our staging cluster. Each node >> has >> > half the number of cores as the node in staging. >> > >> > I agree with your conclusion. >> > >> > I will report back after I collect more data - the flow uses hbase >> heavily >> > toward the end. >> > >> > On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <ryano...@gmail.com> >> wrote: >> > >> >> I'm seeing not much in the way of errors, timeouts, all to one machine >> >> ending with .80, so that is probably your failed node. >> >> >> >> Other than that, the log doesnt seem to say too much. Searching for >> >> strings like FATAL and Exception is the way to go here. >> >> >> >> Also things like this: >> >> 2011-01-14 23:38:52,936 INFO >> >> org.apache.hadoop.hbase.master.AssignmentManager: Region has been >> >> PENDING_OPEN for too long, reassigning region= >> >> >> >> >> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v >> >> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602 >> >> 7.2c40637c6c648a67162cc38d8c6d8ee9. >> >> >> >> >> >> Guessing, I'd probably say your nodes hit some performance wall, with >> >> io-wait, or networking, or something, and Regionserver processes >> >> stopped responding, but did not time out from zookeeper yet... so you >> >> would run into a situation where some nodes are unresponsive, so any >> >> data hosted there would be difficult to talk to. Until the >> >> regionserver times out it's zookeeper node, the master doesnt know >> >> about the fault of the regionserver. >> >> >> >> The master web UI is probably inaccessible because the META table is >> >> on a regionserver that went AWOL. You should check your load, your >> >> ganglia graphs. Also remember, despite having lots of disks, each >> >> node is a gigabit ethernet which means about 110-120 MB/sec. It's >> >> quite possible you are running into network limitations, remember that >> >> regionservers must write to 2 additional datanodes, and there will be >> >> overlap, thus you have to share some of that 110-120MB/sec per node >> >> figure with other nodes, not to mention that you also need to factor >> >> inbound bandwidth (from client->hbase regionserver) and outbound >> >> bandwidth (from datanode replica 1 -> dn replica 2). >> >> >> >> -ryan >> >> >> >> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > Now I cannot access master web UI, This happened after I doubled the >> >> amount >> >> > of data processed in our flow. >> >> > >> >> > I am attaching master log. >> >> > >> >> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryano...@gmail.com> >> wrote: >> >> >> >> >> >> This is the cause: >> >> >> >> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >> >> server >> >> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, >> >> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): >> >> >> regionserver:60020-0x12d7b7b1c760004 >> >> regionserver:60020-0x12d7b7b1c760004 >> >> >> received expired from ZooKeeper, aborting >> >> >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> >> >> >> >> >> Why did the session expire? Typically it's GC, what does your GC >> logs >> >> >> say? Otherwise, network issues perhaps? Swapping? Other machine >> >> >> related systems problems? >> >> >> >> >> >> -ryan >> >> >> >> >> >> >> >> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhih...@gmail.com> >> wrote: >> >> >> > I ran 0.90 RC3 in dev cluster. >> >> >> > >> >> >> > I saw the following in region server log: >> >> >> > >> >> >> > Caused by: org.apache.hadoop.ipc.RemoteException: >> >> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >> rejected; >> >> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com >> >> ,60020,1294856823378 >> >> >> > as >> >> >> > dead server >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648) >> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >> >> > at java.lang.reflect.Method.invoke(Method.java:597) >> >> >> > at >> >> >> > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) >> >> >> > >> >> >> > at >> >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) >> >> >> > at >> >> >> > >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) >> >> >> > at $Proxy0.regionServerReport(Unknown Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702) >> >> >> > ... 2 more >> >> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: >> >> Initiating >> >> >> > client connection, >> >> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181 >> >> sessionTimeout=90000 >> >> >> > watcher=hconnection >> >> >> > 2011-01-13 03:55:08,914 FATAL >> >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING >> region >> >> >> > server >> >> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, >> >> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): >> >> >> > regionserver:60020-0x12d7b7b1c760004 >> >> >> > regionserver:60020-0x12d7b7b1c760004 >> >> >> > received expired from ZooKeeper, aborting >> >> >> > org.apache.zookeeper.KeeperException$SessionExpiredException: >> >> >> > KeeperErrorCode = Session expired >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) >> >> >> > at >> >> >> > >> >> >> > >> >> >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) >> >> >> > at >> >> >> > >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) >> >> >> > >> >> >> > --------------- >> >> >> > >> >> >> > And the following from master log: >> >> >> > >> >> >> > 2011-01-13 03:52:42,003 INFO >> >> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: >> RegionServer >> >> >> > ephemeral node deleted, processing expiration [ >> >> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378] >> >> >> > 2011-01-13 03:52:42,005 DEBUG >> >> >> > org.apache.hadoop.hbase.master.ServerManager: >> >> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead >> >> >> > servers, >> >> >> > submitted shutdown handler to be executed, root=false, meta=false >> >> >> > 2011-01-13 03:52:42,005 INFO >> >> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >> >> Splitting >> >> >> > logs >> >> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 >> >> >> > 2011-01-13 03:52:42,092 INFO >> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 >> >> >> > hlog(s) >> >> >> > in hdfs:// >> >> >> > >> >> >> > >> >> >> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 >> >> >> > 2011-01-13 03:52:42,093 DEBUG >> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer >> thread >> >> >> > Thread[WriterThread-0,5,main]: starting >> >> >> > 2011-01-13 03:52:42,094 DEBUG >> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer >> thread >> >> >> > Thread[WriterThread-1,5,main]: starting >> >> >> > 2011-01-13 03:52:42,096 DEBUG >> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting >> hlog >> >> 1 >> >> >> > of >> >> >> > 1: hdfs:// >> >> >> > >> >> >> > >> >> >> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407 >> >> , >> >> >> > length=0 >> >> >> > >> >> >> > Please advise what could be the cause. >> >> >> > >> >> >> > Thanks >> >> >> > >> >> > >> >> > >> >> >> > >> > >