I'm seeing not much in the way of errors, timeouts, all to one machine ending with .80, so that is probably your failed node.
Other than that, the log doesnt seem to say too much. Searching for strings like FATAL and Exception is the way to go here. Also things like this: 2011-01-14 23:38:52,936 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region= NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602 7.2c40637c6c648a67162cc38d8c6d8ee9. Guessing, I'd probably say your nodes hit some performance wall, with io-wait, or networking, or something, and Regionserver processes stopped responding, but did not time out from zookeeper yet... so you would run into a situation where some nodes are unresponsive, so any data hosted there would be difficult to talk to. Until the regionserver times out it's zookeeper node, the master doesnt know about the fault of the regionserver. The master web UI is probably inaccessible because the META table is on a regionserver that went AWOL. You should check your load, your ganglia graphs. Also remember, despite having lots of disks, each node is a gigabit ethernet which means about 110-120 MB/sec. It's quite possible you are running into network limitations, remember that regionservers must write to 2 additional datanodes, and there will be overlap, thus you have to share some of that 110-120MB/sec per node figure with other nodes, not to mention that you also need to factor inbound bandwidth (from client->hbase regionserver) and outbound bandwidth (from datanode replica 1 -> dn replica 2). -ryan On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Now I cannot access master web UI, This happened after I doubled the amount > of data processed in our flow. > > I am attaching master log. > > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryano...@gmail.com> wrote: >> >> This is the cause: >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): >> regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004 >> received expired from ZooKeeper, aborting >> org.apache.zookeeper.KeeperException$SessionExpiredException: >> >> Why did the session expire? Typically it's GC, what does your GC logs >> say? Otherwise, network issues perhaps? Swapping? Other machine >> related systems problems? >> >> -ryan >> >> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> > I ran 0.90 RC3 in dev cluster. >> > >> > I saw the following in region server log: >> > >> > Caused by: org.apache.hadoop.ipc.RemoteException: >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 >> > as >> > dead server >> > at >> > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197) >> > at >> > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247) >> > at >> > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648) >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> > at >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> > at >> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036) >> > >> > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753) >> > at >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) >> > at $Proxy0.regionServerReport(Unknown Source) >> > at >> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702) >> > ... 2 more >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: Initiating >> > client connection, >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181sessionTimeout=90000 >> > watcher=hconnection >> > 2011-01-13 03:55:08,914 FATAL >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >> > server >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): >> > regionserver:60020-0x12d7b7b1c760004 >> > regionserver:60020-0x12d7b7b1c760004 >> > received expired from ZooKeeper, aborting >> > org.apache.zookeeper.KeeperException$SessionExpiredException: >> > KeeperErrorCode = Session expired >> > at >> > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) >> > at >> > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) >> > at >> > >> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) >> > at >> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) >> > >> > --------------- >> > >> > And the following from master log: >> > >> > 2011-01-13 03:52:42,003 INFO >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer >> > ephemeral node deleted, processing expiration [ >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378] >> > 2011-01-13 03:52:42,005 DEBUG >> > org.apache.hadoop.hbase.master.ServerManager: >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead >> > servers, >> > submitted shutdown handler to be executed, root=false, meta=false >> > 2011-01-13 03:52:42,005 INFO >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting >> > logs >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 >> > 2011-01-13 03:52:42,092 INFO >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 >> > hlog(s) >> > in hdfs:// >> > >> > sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 >> > 2011-01-13 03:52:42,093 DEBUG >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread >> > Thread[WriterThread-0,5,main]: starting >> > 2011-01-13 03:52:42,094 DEBUG >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread >> > Thread[WriterThread-1,5,main]: starting >> > 2011-01-13 03:52:42,096 DEBUG >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 >> > of >> > 1: hdfs:// >> > >> > sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407, >> > length=0 >> > >> > Please advise what could be the cause. >> > >> > Thanks >> > > >