Looks like my conf file wasn't attached: So here goes some of (what I though were relevant) config values: Obviously, I am not asking anyone to go through every one of them but can someone cursorily eyeball them to see if something seems off?
But as of now, it looks like I had too many column families in each region.. Thanks, Vidhya <property> <name>hbase.regionserver.handler.count</name> <value>100</value> <description>Count of RPC Server instances spun up on RegionServers Same property is used by the HMaster for count of master handlers. Default is 25. </description> </property> <property> <name>hbase.regionserver.flushlogentries</name> <value>100</value> <description>Sync the HLog to the HDFS when it has accumulated this many entries. Default 100. Value is checked on every HLog.sync </description> </property> <property> <name>hbase.regionserver.global.memstore.upperLimit</name> <value>0.4</value> <description>Maximum size of all memstores in a region server before new updates are blocked and flushes are forced. Defaults to 40% of heap </description> </property> <property> <name>hbase.regionserver.global.memstore.lowerLimit</name> <value>0.35</value> <description>When memstores are being forced to flush to make room in memory, keep flushing until we hit this mark. Defaults to 30% of heap. This value equal to hbase.regionserver.global.memstore.upperLimit causes the minimum possible flushing to occur when updates are blocked due to memstore limiting. </description> </property> <property> <name>hbase.regionserver.optionallogflushinterval</name> <value>10000</value> <description>Sync the HLog to the HDFS after this interval if it has not accumulated enough entries to trigger a sync. Default 10 seconds. Units: milliseconds. </description> </property> <property> <name>hbase.regionserver.logroll.period</name> <value>3600000</value> <description>Period at which we will roll the commit log.</description> </property> <property> <name>hbase.regionserver.thread.splitcompactcheckfrequency</name> <value>20000</value> <description>How often a region server runs the split/compaction check. </description> </property> <property> <name>hbase.regionserver.nbreservationblocks</name> <value>4</value> <description>The number of reservation blocks which are used to prevent unstable region servers caused by an OOME. </description> </property> <property> <name>hbase.regions.percheckin</name> <value>10</value> <description>Maximum number of regions that can be assigned in a single go to a region server. </description> </property> <property> <name>hbase.server.thread.wakefrequency</name> <value>10000</value> <description>Time to sleep in between searches for work (in milliseconds). Used as sleep interval by service threads such as META scanner and log roller. </description> </property> <property> <name>hbase.hregion.memstore.flush.size</name> <value>67108864</value> <description> Memstore will be flushed to disk if size of the memstore exceeds this number of bytes. Value is checked by a thread that runs every hbase.server.thread.wakefrequency. </description> </property> <property> <name>hbase.hregion.memstore.block.multiplier</name> <value>4</value> <description> MODIFIED Block updates if memstore has hbase.hregion.block.memstore time hbase.hregion.flush.size bytes. Useful preventing runaway memstore during spikes in update traffic. Without an upper-bound, memstore fills such that when it flushes the resultant flush files take a long time to compact or split, or worse, we OOME. </description> </property> <property> <name>hbase.regionserver.maxlogs</name> <value>128</value> <description> Max hlogs you can accumulate before they start rolling (default was 32) Hidden parameter! </description> </property> <property> <name>hbase.hregion.max.filesize</name> <value>268435456</value> <description> Maximum HStoreFile size. If any one of a column families' HStoreFiles has grown to exceed this value, the hosting HRegion is split in two. Default: 256M. </description> </property> <property> <name>hbase.hstore.compactionThreshold</name> <value>3</value> <description> If more than this number of HStoreFiles in any one HStore (one HStoreFile is written per flush of memstore) then a compaction is run to rewrite all HStoreFiles files as one. Larger numbers put off compaction but when it runs, it takes longer to complete. During a compaction, updates cannot be flushed to disk. Long compactions require memory sufficient to carry the logging of all updates across the duration of the compaction. If too large, clients timeout during compaction. </description> </property> <property> <name>hbase.hstore.blockingStoreFiles</name> <value>16</value> <description> MODIFIED FROM 4 If more than this number of StoreFiles in any one Store (one StoreFile is written per flush of MemStore) then updates are blocked for this HRegion until a compaction is completed, or until hbase.hstore.blockingWaitTime has been exceeded. </description> </property> <property> <name>hbase.hstore.blockingWaitTime</name> <value>90000</value> <description> The time an HRegion will block updates for after hitting the StoreFile limit defined by hbase.hstore.blockingStoreFiles. After this time has elapsed, the HRegion will stop blocking updates even if a compaction has not been completed. Default: 90 seconds. </description> </property> <property> <name>hbase.hstore.compaction.max</name> <value>10</value> <description>Max number of HStoreFiles to compact per 'minor' compaction. </description> </property> <property> <name>hbase.hregion.majorcompaction</name> <value>86400000</value> <description>The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. </description> </property> <property> <name>hbase.regions.slop</name> <value>0.1</value> <description>Rebalance if regionserver has average + (average * slop) regions. Default is 10% slop. </description> </property> <property> <name>hfile.min.blocksize.size</name> <value>65536</value> <description>Minimum store file block size. The smaller you make this, the bigger your index and the less you fetch on a random-access. Set size down if you have small cells and want faster random-access of individual cells. </description> </property> <property> <name>hfile.block.cache.size</name> <value>0.6</value> <description> MODIFIED FROM 0.2 Percentage of maximum heap (-Xmx setting) to allocate to block cache used by HFile/StoreFile. Default of 0.2 means allocate 20%. Set to 0 to disable. </description> </property> <property> <name>hbase.client.write.buffer</name> <value>2097152</value> <description>Size of the write buffer in bytes. A bigger buffer takes more memory -- on both the client and server side since server instantiates the passed write buffer to process it -- but reduces the number of RPC. For an estimate of server-side memory-used, evaluate hbase.client.write.buffer * hbase.regionserver.handler.count </description> </property> On 5/13/10 11:23 AM, "Vidhyashankar Venkataraman" <vidhy...@yahoo-inc.com> wrote: Thanks for the prompt response.. Oops, forgot the specifics: I ran the whole thing on five region servers that also run hadoop's data node and task trackers: Each machine has 6 TB disk space (5TB available for the data node and 1 TB for MR and hbase temps), 24Gigs RAM, 3 gigs Hbase-heap size.. How do I give Hbase more RAM (are you talking about a config variable)? 3-4 gigs heap size is the max that 32-bit Java can take (or am I wrong?).. AFAIK, I had synthetically generated the workload and I am pretty sure the column sizes are what I had mentioned.. >> 12 column families is at the extreme regards what we've played with, just >> FYI. Ah, ok.. Will alter the schema then.. >> There may also be corruption in one of the storefiles given that the >> OOME below seems to happen when we try and open a region (but the fact >> of opening may have no relation to why the OOME). True, but then, all the region servers crashed at roughly the same time and for the exact reason (OOME when a region was opened)... Was there a spike in update traffic after the mr job finished? Or was there a compaction happening by any chance? (although I don't see an explicit debug message here: not sure if I had the correct debug log level)... Vidhya On 5/13/10 11:05 AM, "Stack" <st...@duboce.net> wrote: Hello Vidhyashankar: How many regionservers? What version of hbase and hadoop? How much RAM on these machines in total? Can you give HBase more RAM? Also check that you don't have an exceptional cell in your input -- one that is very much larger than the 14KB you not below. 12 column families is at the extreme regards what we've played with, just FYI. You might try with a schema that has less: e.g. one CF for the big cell value and all others into the second CF. There may also be corruption in one of the storefiles given that the OOME below seems to happen when we try and open a region (but the fact of opening may have no relation to why the OOME). St.Ack On Thu, May 13, 2010 at 10:35 AM, Vidhyashankar Venkataraman <vidhy...@yahoo-inc.com> wrote: > This is similar to a mail sent by another user to the group a couple of > months back.. I am quite new to Hbase and I've been trying to conduct a > basic experiment with Hbase.. > > I am trying to load 200 million records each record around 15 KB : with one > column value around 14KB and the rest of the 100 column values 8 bytes > each.. The 120 columns are grouped as 10 qualifiers X 12 families: hope I > got my jargon right.. Note that only one value is quite large for each doc > (when compared to other values)... > The data is uncompressed.. And each value is uniformly randomly selected.. > I used a map-reduce job to load a data file on hdfs into the database.. Soon > after the job finished, the region servers crash with OOM Exception.. Below > is part of the trace from the logs in one of the RS's: > > I have attached the conf along with the email: Can you guys point out any > anamoly in my settings? I have set a heap size of 3 gigs.. Anything > significantly more, java 32-bit doesn't run.. > > > 2010-05-12 19:22:45,068 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=8.43782MB (8847696), Free=1791.2247MB (1878235312), M > ax=1799.6626MB (1887083008), Counts: Blocks=1, Access=16947, Hit=52, > Miss=16895, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.3068389603868127%, > Miss Ratio=99 > .69316124916077%, Evicted/Run=NaN > 2010-05-12 19:22:45,069 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col5/7617863559659933969, > isReference=false, seque > nce id=2470632548, length=8456716, majorCompaction=false > 2010-05-12 19:22:45,075 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col6/1328113038200437659, > isReference=false, seque > nce id=2960732840, length=19861, majorCompaction=false > 2010-05-12 19:22:45,078 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col6/6484804359703635950, > isReference=false, seque > nce id=2470632548, length=8456716, majorCompaction=false > 2010-05-12 19:22:45,082 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col7/1673569837212457160, > isReference=false, seque > nce id=2960732840, length=19861, majorCompaction=false > 2010-05-12 19:22:45,085 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col7/4737399093829085995, > isReference=false, seque > nce id=2470632548, length=8456716, majorCompaction=false > 2010-05-12 19:22:47,238 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col8/8446828932792437464, > isReference=false, seque > nce id=2960732840, length=19861, majorCompaction=false2010-05-12 > 19:22:47,241 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded > /hbase/DocData/1651418343/col8/974386128174268353, isReference=false, sequen > ce id=2470632548, length=8456716, majorCompaction=false > 2010-05-12 19:22:48,804 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col9/2096232603557969237, > isReference=false, seque > nce id=2470632548, length=8456716, majorCompaction=false > 2010-05-12 19:22:48,807 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/1651418343/col9/7088206045660348092, > isReference=false, seque > nce id=2960732840, length=19861, majorCompaction=false > 2010-05-12 19:22:48,808 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region DocData,4824176,1273625075099/1651418343 available; sequence id is > 29607328 > 41 > 2010-05-12 19:22:48,808 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > DocData,40682172,1273607630618 > 2010-05-12 19:22:48,809 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Opening region DocData,40682172,1273607630618, encoded=271889952 > 2010-05-12 19:22:50,924 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/271889952/CONTENT/4859380626868896307, > isReference=false, sequence id=2959849236, length=337563, > majorCompaction=false2010-05-12 19:22:53,037 DEBUG > org.apache.hadoop.hbase.regionserver.Store: loaded > /hbase/DocData/271889952/CONTENT/952776139755887312, isReference=false, sequ > ence id=2082553088, length=110460013, majorCompaction=false > 2010-05-12 19:22:57,404 DEBUG org.apache.hadoop.hbase.regionserver.Store: > loaded /hbase/DocData/271889952/col1/66449684560689857, isReference=false, > sequence > id=2959849236, length=12648, majorCompaction=false > 2010-05-12 19:23:16,165 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening > DocData,40682172,1273607630618 > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedInputStream.<init>(BufferedInputStream.java:178) > at > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1369) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1626) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743) > at java.io.DataInputStream.readFully(DataInputStream.java:178) > at java.io.DataInputStream.readFully(DataInputStream.java:152) > at > org.apache.hadoop.hbase.io.hfile.HFile$FixedFileTrailer.deserialize(HFile.java:1372) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readTrailer(HFile.java:848) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:793) > at > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273) > at > org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129) > at > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410) > at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221) > at > org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1549) > at > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:312) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1564) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1531) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1451) > at java.lang.Thread.run(Thread.java:619) > 2010-05-12 19:23:18,246 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError, > aborting. > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedInputStream.<init>(BufferedInputStream.java:178) > at > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1369) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1626) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743) > at java.io.DataInputStream.readFully(DataInputStream.java:178) > at java.io.DataInputStream.readFully(DataInputStream.java:152) > at > org.apache.hadoop.hbase.io.hfile.HFile$FixedFileTrailer.deserialize(HFile.java:1372) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readTrailer(HFile.java:848) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:793) > at > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273) > at > org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129) > at > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410) > at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221) > at > org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1549) > at > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:312) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1564) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1531) > at > org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1451) > at java.lang.Thread.run(Thread.java:619) > 2010-05-12 19:23:18,246 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=0.0, regions=942, stores=9411, storefiles=19887, > storefileIndexSize=182, memstoreSize=0, compactionQueueSize=0, > usedHeap=2999, maxHeap=2999, blockCacheSize=8847696, > blockCacheFree=1878235312, blockCacheCount=1, blockCacheHitRatio=0, > fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0 > 2010-05-12 19:23:18,247 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting > 2010-05-12 19:23:18,254 INFO org.apache.hadoop.ipc.HBaseServer: Stopping > server on 60020 > 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 0 on 60020: exiting > 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 1 on 60020: exiting > 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 3 on 60020: exiting > 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 2 on 60020: exiting > And so on (The region server has a total of 100 handlers).. > > >