I am having problems with my HRegionServers aborting following rolling their logs and thus not being able to report to the HMaster for a while. Eventually all 5 HRegionservers have exited and my MR which is writing to HBase fails.

Reviewing the logs (with debugging on), I believe what is happening is the Memcache goes negative and compaction and rolling logs begins. After this is complete, the logs shows a FATAL error about not being able to report to the Master and then it exits. It takes about 45 mins to complete the slow death across all the HRservers

I have pasted the logs of one of the HRegionServers (omitting repeated log rolling reports, null pointer exception for closed regions and region closing messages.) Can someone point out what might be the problem? Thanks in advance,

Clay



2008-07-23 16:20:32,991 WARN org.apache.hadoop.hbase.HRegion: Memcache size went negative -59248835; resetting 2008-07-23 16:20:33,003 DEBUG org.apache.hadoop.hbase.HRegion: Finished memcache flush for region anansi,http://org.himssconference.www/,1216843897872 in 16814ms, sequence id=106628188 2008-07-23 16:20:33,003 DEBUG org.apache.hadoop.hbase.HRegionServer: Compaction requested for region: anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:20:33,004 INFO org.apache.hadoop.hbase.HRegion: checking compaction on region anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:20:33,032 INFO org.apache.hadoop.hbase.HRegion: checking compaction completed on region anansi,http://org.himssconference.www/,1216843897872 ; status: false; 2008-07-23 16:26:09,289 DEBUG org.apache.hadoop.hbase.HLog: Closing current log writer /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.060 2008-07-23 16:26:09,298 INFO org.apache.hadoop.hbase.HLog: New log writer created at /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.061 2008-07-23 16:26:14,494 INFO org.apache.hadoop.hbase.HRegionServer: Rolling hlog. Number of entries: 30001 2008-07-23 16:26:14,538 DEBUG org.apache.hadoop.hbase.HLog: Closing current log writer /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.061 2008-07-23 16:26:14,548 INFO org.apache.hadoop.hbase.HLog: New log writer created at /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.062
< omitted >
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 410869ms, ten times longer than scheduled: 10000 2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 402421ms, ten times longer than scheduled: 3000 2008-07-23 16:32:59,060 FATAL org.apache.hadoop.hbase.HRegionServer: unable to report to master for 402861 milliseconds - aborting server 2008-07-23 16:32:59,452 INFO org.apache.hadoop.hbase.Leases: regionserver/0.0.0.0:60020 closing leases 2008-07-23 16:33:00,294 INFO org.apache.hadoop.hbase.Leases $LeaseMonitor: regionserver/0.0.0.0:60020.leaseChecker exiting 2008-07-23 16:33:00,534 INFO org.apache.hadoop.hbase.Leases: regionserver/0.0.0.0:60020 closed leases 2008-07-23 16:33:00,534 INFO org.apache.hadoop.ipc.Server: Stopping server on 60020 2008-07-23 16:33:03,421 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60020 2008-07-23 16:33:03,422 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2008-07-23 16:33:03,437 INFO org.apache.hadoop.hbase.HRegionServer: Stopping infoServer 2008-07-23 16:33:03,537 INFO org.mortbay.util.ThreadedServer: Stopping Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030] 2008-07-23 16:33:04,114 INFO org.mortbay.http.SocketListener: Stopped SocketListener on 0.0.0.0:60030 2008-07-23 16:34:59,945 INFO org.mortbay.util.Container: Stopped HttpContext[/static,/static] 2008-07-23 16:34:59,945 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting 2008-07-23 16:35:15,596 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020.compactor exiting 2008-07-23 16:35:15,715 INFO org.mortbay.util.Container: Stopped HttpContext[/logs,/logs] 2008-07-23 16:35:15,729 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020.cacheFlusher exiting 2008-07-23 16:35:15,948 INFO org.mortbay.util.Container: Stopped [EMAIL PROTECTED] 2008-07-23 16:35:17,528 INFO org.mortbay.util.Container: Stopped WebApplicationContext[/,/] 2008-07-23 16:35:17,544 INFO org.mortbay.util.Container: Stopped [EMAIL PROTECTED] 2008-07-23 16:35:17,544 DEBUG org.apache.hadoop.hbase.HLog: closing log writer in hdfs://namenode:4010/hbase/ log_10.1.2.8_1216843622025_60020 2008-07-23 16:35:17,773 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 60020, call batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773 , 9223372036854775807, [EMAIL PROTECTED]) from 10.1.2.7:52761: error: java.io.IOException: Server not running
java.io.IOException: Server not running
at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java: 1758) at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java: 1547)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
2008-07-23 16:35:17,802 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 60020, call batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773 , 9223372036854775807, [EMAIL PROTECTED]) from 10.1.2.8:49245: error: java.io.IOException: Server not running
java.io.IOException: Server not running
at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java: 1758) at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java: 1547)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun .reflect .DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java: 25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
2

2008-07-23 16:35:18,735 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 60020: exiting 2008-07-23 16:35:18,735 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 60020 caught: java.lang.NullPointerException
        at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
        at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java: 646)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:659)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)

2008-07-23 16:35:18,736 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 60020: exiting 2008-07-23 16:35:18,774 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773 , 9223372036854775807, [EMAIL PROTECTED]) from 10.1.2.8:49245: output error 2008-07-23 16:35:18,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 60020 caught: java.lang.NullPointerException
        at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
        at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java: 646)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:659)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)

2008-07-23 16:35:18,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 60020: exiting 2008-07-23 16:35:18,791 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773 , 9223372036854775807, [EMAIL PROTECTED]) from 10.1.2.4:43896: output error 2008-07-23 16:35:18,808 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 60020 caught: java.lang.NullPointerException
        at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
        at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java: 646)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:659)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun .reflect .NativeConstructorAccessorImpl .newInstance(NativeConstructorAccessorImpl.java:39) at sun .reflect .DelegatingConstructorAccessorImpl .newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org .apache .hadoop .hbase .RemoteExceptionHandler .decodeRemoteException(RemoteExceptionHandler.java:82) at org .apache .hadoop .hbase .RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:48)
        at org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:873)
        at java.lang.Thread.run(Thread.java:619)


2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegionServer: closing region anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: compactions and cache flushes disabled for region anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: new updates and scanners for region anansi,http://org.himssconference.www/,1216843897872 disabled 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: no more active scanners for region anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:35:19,239 DEBUG org.apache.hadoop.hbase.HRegion: no more row locks outstanding on region anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:35:19,239 DEBUG org.apache.hadoop.hbase.HStore: closed 1854090017/out_edge 2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed 1854090017/attrib 2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed 1854090017/in_edge 2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed 1854090017/update 2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegion: closed anansi,http://org.himssconference.www/,1216843897872 2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegionServer: aborting server at: 10.1.2.8:60020 2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020 exiting 2008-07-23 16:35:19,585 INFO org.apache.hadoop.hbase.HRegionServer: Starting shutdown thread. 2008-07-23 16:35:19,585 INFO org.apache.hadoop.hbase.HRegionServer: Shutdown thread complete


William Clay Moody
[EMAIL PROTECTED]



Reply via email to