This is whats doing it in:
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 410869ms, ten times longer than scheduled: 10000
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 402421ms, ten times longer than scheduled: 3000
If regionserver doesn't report the master within
hbase.master.lease.period (60 seconds by default), master presumes it
dead and assigns its regions elsewhere. If the regionserver hasn't been
able to signal the master in that time, it shuts itself down to be
safe. You could try upping the default lease periods.
Whats the load on your machines? It looks to be so 'high', java
schedulers not getting any slices. You're running datanodes,
regionservers and tasktrackers all on the one node? What kind of a node
is it? You running RAID by any chance? Is this soon after your MR job
starts? How many mappers are loading HBase at one time? How many
regions do you have? If soon after you start up, perhaps all mappers
are going against a single region. Can you reduce the number of mappers
and see if life is better?
Thanks,
St.Ack
William Clay Moody wrote:
I am having problems with my HRegionServers aborting following rolling
their logs and thus not being able to report to the HMaster for a
while. Eventually all 5 HRegionservers have exited and my MR which is
writing to HBase fails.
Reviewing the logs (with debugging on), I believe what is happening is
the Memcache goes negative and compaction and rolling logs begins.
After this is complete, the logs shows a FATAL error about not being
able to report to the Master and then it exits. It takes about 45 mins
to complete the slow death across all the HRservers
I have pasted the logs of one of the HRegionServers (omitting repeated
log rolling reports, null pointer exception for closed regions and
region closing messages.) Can someone point out what might be the
problem? Thanks in advance,
Clay
2008-07-23 16:20:32,991 WARN org.apache.hadoop.hbase.HRegion: Memcache
size went negative -59248835; resetting
2008-07-23 16:20:33,003 DEBUG org.apache.hadoop.hbase.HRegion:
Finished memcache flush for region
anansi,http://org.himssconference.www/,1216843897872 in 16814ms,
sequence id=106628188
2008-07-23 16:20:33,003 DEBUG org.apache.hadoop.hbase.HRegionServer:
Compaction requested for region:
anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:20:33,004 INFO org.apache.hadoop.hbase.HRegion: checking
compaction on region anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:20:33,032 INFO org.apache.hadoop.hbase.HRegion: checking
compaction completed on region
anansi,http://org.himssconference.www/,1216843897872; status: false;
2008-07-23 16:26:09,289 DEBUG org.apache.hadoop.hbase.HLog: Closing
current log writer /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.060
2008-07-23 16:26:09,298 INFO org.apache.hadoop.hbase.HLog: New log
writer created at /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.061
2008-07-23 16:26:14,494 INFO org.apache.hadoop.hbase.HRegionServer:
Rolling hlog. Number of entries: 30001
2008-07-23 16:26:14,538 DEBUG org.apache.hadoop.hbase.HLog: Closing
current log writer /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.061
2008-07-23 16:26:14,548 INFO org.apache.hadoop.hbase.HLog: New log
writer created at /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.062
< omitted >
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 410869ms, ten times longer than scheduled: 10000
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 402421ms, ten times longer than scheduled: 3000
2008-07-23 16:32:59,060 FATAL org.apache.hadoop.hbase.HRegionServer:
unable to report to master for 402861 milliseconds - aborting server
2008-07-23 16:32:59,452 INFO org.apache.hadoop.hbase.Leases:
regionserver/0.0.0.0:60020 closing leases
2008-07-23 16:33:00,294 INFO
org.apache.hadoop.hbase.Leases$LeaseMonitor:
regionserver/0.0.0.0:60020.leaseChecker exiting
2008-07-23 16:33:00,534 INFO org.apache.hadoop.hbase.Leases:
regionserver/0.0.0.0:60020 closed leases
2008-07-23 16:33:00,534 INFO org.apache.hadoop.ipc.Server: Stopping
server on 60020
2008-07-23 16:33:03,421 INFO org.apache.hadoop.ipc.Server: Stopping
IPC Server listener on 60020
2008-07-23 16:33:03,422 INFO org.apache.hadoop.ipc.Server: Stopping
IPC Server Responder
2008-07-23 16:33:03,437 INFO org.apache.hadoop.hbase.HRegionServer:
Stopping infoServer
2008-07-23 16:33:03,537 INFO org.mortbay.util.ThreadedServer: Stopping
Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]
2008-07-23 16:33:04,114 INFO org.mortbay.http.SocketListener: Stopped
SocketListener on 0.0.0.0:60030
2008-07-23 16:34:59,945 INFO org.mortbay.util.Container: Stopped
HttpContext[/static,/static]
2008-07-23 16:34:59,945 INFO org.apache.hadoop.hbase.HRegionServer:
worker thread exiting
2008-07-23 16:35:15,596 INFO org.apache.hadoop.hbase.HRegionServer:
regionserver/0.0.0.0:60020.compactor exiting
2008-07-23 16:35:15,715 INFO org.mortbay.util.Container: Stopped
HttpContext[/logs,/logs]
2008-07-23 16:35:15,729 INFO org.apache.hadoop.hbase.HRegionServer:
regionserver/0.0.0.0:60020.cacheFlusher exiting
2008-07-23 16:35:15,948 INFO org.mortbay.util.Container: Stopped
[EMAIL PROTECTED]
2008-07-23 16:35:17,528 INFO org.mortbay.util.Container: Stopped
WebApplicationContext[/,/]
2008-07-23 16:35:17,544 INFO org.mortbay.util.Container: Stopped
[EMAIL PROTECTED]
2008-07-23 16:35:17,544 DEBUG org.apache.hadoop.hbase.HLog: closing
log writer in hdfs://namenode:4010/hbase/log_10.1.2.8_1216843622025_60020
2008-07-23 16:35:17,773 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 60020, call
batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
9223372036854775807, [EMAIL PROTECTED])
from 10.1.2.7:52761: error: java.io.IOException: Server not running
java.io.IOException: Server not running
at
org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
at
org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
2008-07-23 16:35:17,802 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call
batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
9223372036854775807, [EMAIL PROTECTED])
from 10.1.2.8:49245: error: java.io.IOException: Server not running
java.io.IOException: Server not running
at
org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
at
org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
2
2008-07-23 16:35:18,735 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020: exiting
2008-07-23 16:35:18,735 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 60020 caught: java.lang.NullPointerException
at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
at
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:646)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:659)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)
2008-07-23 16:35:18,736 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 60020: exiting
2008-07-23 16:35:18,774 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call
batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
9223372036854775807, [EMAIL PROTECTED])
from 10.1.2.8:49245: output error
2008-07-23 16:35:18,774 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 60020 caught: java.lang.NullPointerException
at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
at
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:646)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:659)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)
2008-07-23 16:35:18,774 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 60020: exiting
2008-07-23 16:35:18,791 WARN org.apache.hadoop.ipc.Server: IPC Server
Responder, call
batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
9223372036854775807, [EMAIL PROTECTED])
from 10.1.2.4:43896: output error
2008-07-23 16:35:18,808 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 60020 caught: java.lang.NullPointerException
at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
at
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:646)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:659)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:48)
at org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:873)
at java.lang.Thread.run(Thread.java:619)
2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegionServer:
closing region anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion:
compactions and cache flushes disabled for region
anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: new
updates and scanners for region
anansi,http://org.himssconference.www/,1216843897872 disabled
2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: no more
active scanners for region
anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:35:19,239 DEBUG org.apache.hadoop.hbase.HRegion: no more
row locks outstanding on region
anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:35:19,239 DEBUG org.apache.hadoop.hbase.HStore: closed
1854090017/out_edge
2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed
1854090017/attrib
2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed
1854090017/in_edge
2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed
1854090017/update
2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegion: closed
anansi,http://org.himssconference.www/,1216843897872
2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegionServer:
aborting server at: 10.1.2.8:60020
2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegionServer:
regionserver/0.0.0.0:60020 exiting
2008-07-23 16:35:19,585 INFO org.apache.hadoop.hbase.HRegionServer:
Starting shutdown thread.
2008-07-23 16:35:19,585 INFO org.apache.hadoop.hbase.HRegionServer:
Shutdown thread complete
William Clay Moody
[EMAIL PROTECTED]