Thanks StAck,
From the documentation, I felt this might be the problem.
My 5 regionservers are also the 5 TaskTrackers and Datanodes for the
cluster. A separate node is the NameNode, HMaster and JobTracker.
We are running 10 maps simultaneously and most of the commits to HBase
are going to the same subset of regions, generally on the same HRS. We
have a total of 30 regions. The fault was occurring pretty early into
the reduce phase of the reducer (post copy and sort). All the maps
complete successfully (input data is not from HBase).
Here is the MR stats at failure:
Kind % Complete Num Tasks Pending Running Complete Killed Failed/
Killed Task Attempts
map 100.00% 51 0 0
51 0 0 / 5
reduce 85.49% 5 0 0
0 5 15 / 4
The machines that make up the cluster are: AMD Athlon64 3000+ 2GHZ,
with 1GB PC3200 DDR running Ubuntu Server 64-bit. Hadoop 0.16.4 and
HBase 0.1.3.
I have a feeling I might be trying to do too much with too little,
especially considering data input jobs create very hot regions due to
the nature of the data.
I will try to increase the default lease period and spread out the data.
Thanks again
Clay
William Clay Moody
[EMAIL PROTECTED]
From stack <[EMAIL PROTECTED]>
Subject Re: HRegionservers aborting after rolling logs
Date Wed, 23 Jul 2008 22:29:52 GMT
This is whats doing it in:
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 410869ms, ten times longer than scheduled: 10000
2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 402421ms, ten times longer than scheduled: 3000
If regionserver doesn't report the master within
hbase.master.lease.period (60 seconds by default), master presumes it
dead and assigns its regions elsewhere. If the regionserver hasn't been
able to signal the master in that time, it shuts itself down to be
safe. You could try upping the default lease periods.
Whats the load on your machines? It looks to be so 'high', java
schedulers not getting any slices. You're running datanodes,
regionservers and tasktrackers all on the one node? What kind of a node
is it? You running RAID by any chance? Is this soon after your MR job
starts? How many mappers are loading HBase at one time? How many
regions do you have? If soon after you start up, perhaps all mappers
are going against a single region. Can you reduce the number of mappers
and see if life is better?
Thanks,
St.Ack
William Clay Moody wrote:
> I am having problems with my HRegionServers aborting following
rolling
> their logs and thus not being able to report to the HMaster for a
> while. Eventually all 5 HRegionservers have exited and my MR which is
> writing to HBase fails.
>
> Reviewing the logs (with debugging on), I believe what is happening
is
> the Memcache goes negative and compaction and rolling logs begins.
> After this is complete, the logs shows a FATAL error about not being
> able to report to the Master and then it exits. It takes about 45
mins
> to complete the slow death across all the HRservers
>
> I have pasted the logs of one of the HRegionServers (omitting
repeated
> log rolling reports, null pointer exception for closed regions and
> region closing messages.) Can someone point out what might be the
> problem? Thanks in advance,
>
> Clay
>
>
>
> 2008-07-23 16:20:32,991 WARN org.apache.hadoop.hbase.HRegion:
Memcache
> size went negative -59248835; resetting
> 2008-07-23 16:20:33,003 DEBUG org.apache.hadoop.hbase.HRegion:
> Finished memcache flush for region
> anansi,http://org.himssconference.www/,1216843897872 in 16814ms,
> sequence id=106628188
> 2008-07-23 16:20:33,003 DEBUG org.apache.hadoop.hbase.HRegionServer:
> Compaction requested for region:
> anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:20:33,004 INFO org.apache.hadoop.hbase.HRegion:
checking
> compaction on region anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:20:33,032 INFO org.apache.hadoop.hbase.HRegion:
checking
> compaction completed on region
> anansi,http://org.himssconference.www/,1216843897872; status: false;
> 2008-07-23 16:26:09,289 DEBUG org.apache.hadoop.hbase.HLog: Closing
> current log writer /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.
060
> 2008-07-23 16:26:09,298 INFO org.apache.hadoop.hbase.HLog: New log
> writer created at /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.
061
> 2008-07-23 16:26:14,494 INFO org.apache.hadoop.hbase.HRegionServer:
> Rolling hlog. Number of entries: 30001
> 2008-07-23 16:26:14,538 DEBUG org.apache.hadoop.hbase.HLog: Closing
> current log writer /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.
061
> 2008-07-23 16:26:14,548 INFO org.apache.hadoop.hbase.HLog: New log
> writer created at /hbase/log_10.1.2.8_1216843622025_60020/hlog.dat.
062
> < omitted >
> 2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 410869ms, ten times longer than scheduled: 10000
> 2008-07-23 16:32:58,652 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 402421ms, ten times longer than scheduled: 3000
> 2008-07-23 16:32:59,060 FATAL org.apache.hadoop.hbase.HRegionServer:
> unable to report to master for 402861 milliseconds - aborting server
> 2008-07-23 16:32:59,452 INFO org.apache.hadoop.hbase.Leases:
> regionserver/0.0.0.0:60020 closing leases
> 2008-07-23 16:33:00,294 INFO
> org.apache.hadoop.hbase.Leases$LeaseMonitor:
> regionserver/0.0.0.0:60020.leaseChecker exiting
> 2008-07-23 16:33:00,534 INFO org.apache.hadoop.hbase.Leases:
> regionserver/0.0.0.0:60020 closed leases
> 2008-07-23 16:33:00,534 INFO org.apache.hadoop.ipc.Server: Stopping
> server on 60020
> 2008-07-23 16:33:03,421 INFO org.apache.hadoop.ipc.Server: Stopping
> IPC Server listener on 60020
> 2008-07-23 16:33:03,422 INFO org.apache.hadoop.ipc.Server: Stopping
> IPC Server Responder
> 2008-07-23 16:33:03,437 INFO org.apache.hadoop.hbase.HRegionServer:
> Stopping infoServer
> 2008-07-23 16:33:03,537 INFO org.mortbay.util.ThreadedServer:
Stopping
> Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]
> 2008-07-23 16:33:04,114 INFO org.mortbay.http.SocketListener: Stopped
> SocketListener on 0.0.0.0:60030
> 2008-07-23 16:34:59,945 INFO org.mortbay.util.Container: Stopped
> HttpContext[/static,/static]
> 2008-07-23 16:34:59,945 INFO org.apache.hadoop.hbase.HRegionServer:
> worker thread exiting
> 2008-07-23 16:35:15,596 INFO org.apache.hadoop.hbase.HRegionServer:
> regionserver/0.0.0.0:60020.compactor exiting
> 2008-07-23 16:35:15,715 INFO org.mortbay.util.Container: Stopped
> HttpContext[/logs,/logs]
> 2008-07-23 16:35:15,729 INFO org.apache.hadoop.hbase.HRegionServer:
> regionserver/0.0.0.0:60020.cacheFlusher exiting
> 2008-07-23 16:35:15,948 INFO org.mortbay.util.Container: Stopped
> [EMAIL PROTECTED]
> 2008-07-23 16:35:17,528 INFO org.mortbay.util.Container: Stopped
> WebApplicationContext[/,/]
> 2008-07-23 16:35:17,544 INFO org.mortbay.util.Container: Stopped
> [EMAIL PROTECTED]
> 2008-07-23 16:35:17,544 DEBUG org.apache.hadoop.hbase.HLog: closing
> log writer in hdfs://namenode:4010/hbase/
log_10.1.2.8_1216843622025_60020
> 2008-07-23 16:35:17,773 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 60020, call
> batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
> 9223372036854775807, [EMAIL PROTECTED])
> from 10.1.2.7:52761: error: java.io.IOException: Server not running
> java.io.IOException: Server not running
> at
> org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:
1758)
> at
>
org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:
1547)
>
> at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> at
>
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
> 2008-07-23 16:35:17,802 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 7 on 60020, call
> batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
> 9223372036854775807, [EMAIL PROTECTED])
> from 10.1.2.8:49245: error: java.io.IOException: Server not running
> java.io.IOException: Server not running
> at
> org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:
1758)
> at
>
org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:
1547)
>
> at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
> at
>
sun
.reflect
.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
> 2
>
> 2008-07-23 16:35:18,735 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 7 on 60020: exiting
> 2008-07-23 16:35:18,735 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 4 on 60020 caught: java.lang.NullPointerException
> at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
> at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
> at
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:
646)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:
659)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)
>
> 2008-07-23 16:35:18,736 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 4 on 60020: exiting
> 2008-07-23 16:35:18,774 WARN org.apache.hadoop.ipc.Server: IPC Server
> Responder, call
> batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
> 9223372036854775807, [EMAIL PROTECTED])
> from 10.1.2.8:49245: output error
> 2008-07-23 16:35:18,774 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 60020 caught: java.lang.NullPointerException
> at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
> at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
> at
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:
646)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:
659)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)
>
> 2008-07-23 16:35:18,774 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 60020: exiting
> 2008-07-23 16:35:18,791 WARN org.apache.hadoop.ipc.Server: IPC Server
> Responder, call
> batchUpdate(anansi,http://com.davidhellmann.www/,1216150597773,
> 9223372036854775807, [EMAIL PROTECTED])
> from 10.1.2.4:43896: output error
> 2008-07-23 16:35:18,808 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 3 on 60020 caught: java.lang.NullPointerException
> at org.apache.hadoop.ipc.Server.closeConnection(Server.java:976)
> at org.apache.hadoop.ipc.Server.access$900(Server.java:66)
> at
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:
646)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:
659)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:922)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
>
sun
.reflect
.NativeConstructorAccessorImpl
.newInstance(NativeConstructorAccessorImpl.java:39)
>
> at
>
sun
.reflect
.DelegatingConstructorAccessorImpl
.newInstance(DelegatingConstructorAccessorImpl.java:27)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:
513)
> at
>
org
.apache
.hadoop
.hbase
.RemoteExceptionHandler
.decodeRemoteException(RemoteExceptionHandler.java:82)
>
> at
>
org
.apache
.hadoop
.hbase
.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:48)
>
> at org.apache.hadoop.hbase.HRegionServer.run(HRegionServer.java:
873)
> at java.lang.Thread.run(Thread.java:619)
>
>
> 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegionServer:
> closing region anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion:
> compactions and cache flushes disabled for region
> anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: new
> updates and scanners for region
> anansi,http://org.himssconference.www/,1216843897872 disabled
> 2008-07-23 16:35:19,221 DEBUG org.apache.hadoop.hbase.HRegion: no
more
> active scanners for region
> anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:35:19,239 DEBUG org.apache.hadoop.hbase.HRegion: no
more
> row locks outstanding on region
> anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:35:19,239 DEBUG org.apache.hadoop.hbase.HStore: closed
> 1854090017/out_edge
> 2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed
> 1854090017/attrib
> 2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed
> 1854090017/in_edge
> 2008-07-23 16:35:19,240 DEBUG org.apache.hadoop.hbase.HStore: closed
> 1854090017/update
> 2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegion: closed
> anansi,http://org.himssconference.www/,1216843897872
> 2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegionServer:
> aborting server at: 10.1.2.8:60020
> 2008-07-23 16:35:19,240 INFO org.apache.hadoop.hbase.HRegionServer:
> regionserver/0.0.0.0:60020 exiting
> 2008-07-23 16:35:19,585 INFO org.apache.hadoop.hbase.HRegionServer:
> Starting shutdown thread.
> 2008-07-23 16:35:19,585 INFO org.apache.hadoop.hbase.HRegionServer:
> Shutdown thread complete
>
>
> William Clay Moody
> [EMAIL PROTECTED]
>
>
>