Here are GC-related parameters: /usr/java/jdk1.6/bin/java -Xmx4000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode
The heap dump is big: -rw------- 1 hadoop users 4146551927 Aug 11 03:59 java_pid26972.hprof Do you have ftp server where I can upload it ? Thanks On Tue, Aug 10, 2010 at 9:38 PM, Stack <st...@duboce.net> wrote: > Ted: > > You have 22 column families in your schema? Do you need that many? > Run with less if you can because 22 CFs takes you into a category that > not many hang out in. It may be at the root of the OOME. > > Otherwise, its the usual suspects -- a bad record perhaps? One that > was incorrectly formatted so it had a very large size on it? > > Do you run w/ GC enabled? If not, try it. Apparently its near to > frictionless. It might give us more clues. > > Also, when the RS crashes, it'll dump heap by default. Do you see it? > If you put it someplace that I can pull, I'll take a look at it. > > St.Ack > > On Tue, Aug 10, 2010 at 9:30 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > We use 0.20.6 with HBASE-2473 > > As you can see from the following region server log snippet, OOME > happened > > to this RS: > > > > 2010-08-11 03:59:12,760 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Blocking updates for 'IPC Server handler 17 on 60020' on region > > > 2__HB_NOINC_GRID_0809-THREEGPPSPEECHCALLS-1281499094297,\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E,1281499095128: > > memstore size 1.0g is >= than blocking 1.0g size > > 2010-08-11 03:59:16,853 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Blocking updates for 'IPC Server handler 24 on 60020' on region > > > 2__HB_NOINC_GRID_0809-THREEGPPSPEECHCALLS-1281499094297,\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E\x0E,1281499095128: > > memstore size 1.0g is >= than blocking 1.0g size > > 2010-08-11 03:59:44,524 FATAL > > org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError, > > aborting. > > java.lang.OutOfMemoryError: Java heap space > > at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39) > > at java.nio.ByteBuffer.allocate(ByteBuffer.java:312) at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:825) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419) > > at > > > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318) > > 2010-08-11 03:59:44,525 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > > request=0.0, regions=9, stores=22, storefiles=4, storefileIndexSize=5, > > memstoreSize=1502, compactionQueueSize=0, usedHeap=*3929*, maxHeap=3973, > > blockCacheSize=6836104, blockCacheFree=826362424, blockCacheCount=0, > > blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 > > > > Among the other RS, the highest usedHeap is 1750 > > > > On Sat, Jul 31, 2010 at 3:31 PM, Ryan Rawson <ryano...@gmail.com> wrote: > > > >> Hi, > >> > >> #3 is going to be tricky... due to the ebb And flow of the gc this value > >> isn't as accurate as one would wish. Furthermore we flush nematodes > based > >> on > >> ram pressure. > >> > >> Any algorithm would have to have the property of being stable and > >> conservative... rebalancing is not a 0 impact operation. > >> > >> There are jiras open for the rebalance based on load. To date it hasn't > >> been > >> a practical problem here at SU in our prod clusters however. > >> > >> On Jul 31, 2010 3:18 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: > >> > Hi, > >> > Currently load balancing only considers region count. > >> > See ServerManager.getAverageLoad() > >> > > >> > I think load balancing should consider the following three factors for > >> each > >> > RS: > >> > 1. number of regions it hosts > >> > 2. number of requests it serves within given period > >> > 3. how close usedHeap is to maxHeap > >> > > >> > Please comment how we should weigh the above three factors in deciding > >> the > >> > regions to offload from each RS. > >> > > >> > Thanks > >> > > >