Re: Regionservers crash with an OutOfMemoryException after a data-intensive map reduce job..

Vidhyashankar Venkataraman Thu, 13 May 2010 11:46:20 -0700

Looks like my conf file wasn't attached:
So here goes some of (what I though were relevant) config values: Obviously, I 
am not asking anyone to go through every one of them but can someone cursorily 
eyeball them to see if something seems off?


   But as of now, it looks like I had too many column families in each region..

Thanks,
Vidhya

  <property>
    <name>hbase.regionserver.handler.count</name>
    <value>100</value>
    <description>Count of RPC Server instances spun up on RegionServers
    Same property is used by the HMaster for count of master handlers.
    Default is 25.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.flushlogentries</name>
    <value>100</value>
    <description>Sync the HLog to the HDFS when it has accumulated this many
    entries. Default 100. Value is checked on every HLog.sync
    </description>
  </property>
  <property>
    <name>hbase.regionserver.global.memstore.upperLimit</name>
    <value>0.4</value>
    <description>Maximum size of all memstores in a region server before new
      updates are blocked and flushes are forced. Defaults to 40% of heap
    </description>
  </property>
  <property>
    <name>hbase.regionserver.global.memstore.lowerLimit</name>
    <value>0.35</value>
    <description>When memstores are being forced to flush to make room in
      memory, keep flushing until we hit this mark. Defaults to 30% of heap.
      This value equal to hbase.regionserver.global.memstore.upperLimit causes
      the minimum possible flushing to occur when updates are blocked due to
      memstore limiting.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.optionallogflushinterval</name>
    <value>10000</value>
    <description>Sync the HLog to the HDFS after this interval if it has not
    accumulated enough entries to trigger a sync. Default 10 seconds. Units:
    milliseconds.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.logroll.period</name>
    <value>3600000</value>
    <description>Period at which we will roll the commit log.</description>
  </property>
  <property>
    <name>hbase.regionserver.thread.splitcompactcheckfrequency</name>
    <value>20000</value>
    <description>How often a region server runs the split/compaction check.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.nbreservationblocks</name>
    <value>4</value>
    <description>The number of reservation blocks which are used to prevent
    unstable region servers caused by an OOME.
    </description>
  </property>
  <property>
    <name>hbase.regions.percheckin</name>
    <value>10</value>
    <description>Maximum number of regions that can be assigned in a single go
    to a region server.
    </description>
  </property>
  <property>
    <name>hbase.server.thread.wakefrequency</name>
    <value>10000</value>
    <description>Time to sleep in between searches for work (in milliseconds).
    Used as sleep interval by service threads such as META scanner and log 
roller.
    </description>
  </property>
  <property>
    <name>hbase.hregion.memstore.flush.size</name>
    <value>67108864</value>
    <description>
    Memstore will be flushed to disk if size of the memstore
    exceeds this number of bytes.  Value is checked by a thread that runs
    every hbase.server.thread.wakefrequency.
    </description>
  </property>
  <property>
    <name>hbase.hregion.memstore.block.multiplier</name>
    <value>4</value>
    <description>
    MODIFIED
    Block updates if memstore has hbase.hregion.block.memstore
    time hbase.hregion.flush.size bytes.  Useful preventing
    runaway memstore during spikes in update traffic.  Without an
    upper-bound, memstore fills such that when it flushes the
    resultant flush files take a long time to compact or split, or
    worse, we OOME.
    </description>
  </property>
  <property>
    <name>hbase.regionserver.maxlogs</name>
    <value>128</value>
    <description>
    Max hlogs you can accumulate before they start rolling (default was 32)
    Hidden parameter!
    </description>
  </property>
  <property>
    <name>hbase.hregion.max.filesize</name>
    <value>268435456</value>
    <description>
    Maximum HStoreFile size. If any one of a column families' HStoreFiles has
    grown to exceed this value, the hosting HRegion is split in two.
    Default: 256M.
    </description>
  </property>
  <property>
    <name>hbase.hstore.compactionThreshold</name>
    <value>3</value>
    <description>
    If more than this number of HStoreFiles in any one HStore
    (one HStoreFile is written per flush of memstore) then a compaction
    is run to rewrite all HStoreFiles files as one.  Larger numbers
    put off compaction but when it runs, it takes longer to complete.
    During a compaction, updates cannot be flushed to disk.  Long
    compactions require memory sufficient to carry the logging of
    all updates across the duration of the compaction.

    If too large, clients timeout during compaction.
    </description>
  </property>
  <property>
    <name>hbase.hstore.blockingStoreFiles</name>
    <value>16</value>
    <description>
    MODIFIED FROM 4
    If more than this number of StoreFiles in any one Store
    (one StoreFile is written per flush of MemStore) then updates are
    blocked for this HRegion until a compaction is completed, or
    until hbase.hstore.blockingWaitTime has been exceeded.
    </description>
  </property>
  <property>
    <name>hbase.hstore.blockingWaitTime</name>
    <value>90000</value>
    <description>
    The time an HRegion will block updates for after hitting the StoreFile
    limit defined by hbase.hstore.blockingStoreFiles.
    After this time has elapsed, the HRegion will stop blocking updates even
    if a compaction has not been completed.  Default: 90 seconds.
    </description>
  </property>
  <property>
    <name>hbase.hstore.compaction.max</name>
    <value>10</value>
    <description>Max number of HStoreFiles to compact per 'minor' compaction.
    </description>
  </property>
  <property>
    <name>hbase.hregion.majorcompaction</name>
    <value>86400000</value>
    <description>The time (in miliseconds) between 'major' compactions of all
    HStoreFiles in a region.  Default: 1 day.
    </description>
  </property>
  <property>
    <name>hbase.regions.slop</name>
    <value>0.1</value>
    <description>Rebalance if regionserver has average + (average * slop) 
regions.
    Default is 10% slop.
    </description>
  </property>
  <property>
    <name>hfile.min.blocksize.size</name>
    <value>65536</value>
    <description>Minimum store file block size.  The smaller you make this, the
    bigger your index and the less you fetch on a random-access.  Set size down
    if you have small cells and want faster random-access of individual cells.
    </description>
  </property>
  <property>
      <name>hfile.block.cache.size</name>
      <value>0.6</value>
      <description>
          MODIFIED FROM 0.2
          Percentage of maximum heap (-Xmx setting) to allocate to block cache
          used by HFile/StoreFile. Default of 0.2 means allocate 20%.
          Set to 0 to disable.
      </description>
  </property>
<property>
    <name>hbase.client.write.buffer</name>
    <value>2097152</value>
    <description>Size of the write buffer in bytes. A bigger buffer takes more
    memory -- on both the client and server side since server instantiates
    the passed write buffer to process it -- but reduces the number of RPC.
    For an estimate of server-side memory-used, evaluate
    hbase.client.write.buffer * hbase.regionserver.handler.count
    </description>
  </property>



On 5/13/10 11:23 AM, "Vidhyashankar Venkataraman" <vidhy...@yahoo-inc.com> 
wrote:

Thanks for the prompt response..

Oops, forgot the specifics: I ran the whole thing on five region servers that 
also run hadoop's data node and task trackers: Each machine has 6 TB disk space 
(5TB available for the data node and 1 TB for MR and hbase temps), 24Gigs RAM, 
3 gigs Hbase-heap size.. How do I give Hbase more RAM (are you talking about a 
config variable)? 3-4 gigs heap size is the max that 32-bit Java can take (or 
am I wrong?)..

AFAIK, I had synthetically generated the workload and I am pretty sure the 
column sizes are what I had mentioned..

>> 12 column families is at the extreme regards what we've played with, just 
>> FYI.
Ah, ok.. Will alter the schema then..

>> There may also be corruption in one of the storefiles given that the
>> OOME below seems to happen when we try and open a region (but the fact
>> of opening may have no relation to why the OOME).
True, but then, all the region servers crashed at roughly the same time and for 
the exact reason (OOME when a region was opened)...

Was there a spike in update traffic after the mr job finished? Or was there a 
compaction happening by any chance? (although I don't see an explicit debug 
message here: not sure if I had the correct debug log level)...

Vidhya

On 5/13/10 11:05 AM, "Stack" <st...@duboce.net> wrote:

Hello Vidhyashankar:

How many regionservers?   What version of hbase and hadoop?  How much
RAM on these machines in total?  Can you give HBase more RAM?

Also check that you don't have an exceptional cell in your input --
one that is very much larger than the 14KB you not below.

12 column families is at the extreme regards what we've played with,
just FYI.  You might try with a schema that has less: e.g. one CF for
the big cell value and all others into the second CF.

There may also be corruption in one of the storefiles given that the
OOME below seems to happen when we try and open a region (but the fact
of opening may have no relation to why the OOME).

St.Ack


On Thu, May 13, 2010 at 10:35 AM, Vidhyashankar Venkataraman
<vidhy...@yahoo-inc.com> wrote:
> This is similar to a mail sent by another user to the group a couple of
> months back.. I am quite new to Hbase and I've been trying to conduct a
> basic experiment with Hbase..
>
> I am trying to load 200 million records each record around 15 KB : with one
> column value around 14KB and the rest of the 100 column values 8 bytes
> each.. The 120 columns are grouped as 10 qualifiers X 12 families: hope I
> got my jargon right.. Note that only one value is quite large for each doc
> (when compared to other values)...
> The data is uncompressed.. And each value is uniformly randomly selected..
> I used a map-reduce job to load a data file on hdfs into the database.. Soon
> after the job finished, the region servers crash with OOM Exception.. Below
> is part of the trace from the logs in one of the RS's:
>
> I have attached the conf along with the email: Can you guys point out any
> anamoly in my settings? I have set a heap size of 3 gigs.. Anything
> significantly more, java 32-bit doesn't run..
>
>
> 2010-05-12 19:22:45,068 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=8.43782MB (8847696), Free=1791.2247MB (1878235312), M
> ax=1799.6626MB (1887083008), Counts: Blocks=1, Access=16947, Hit=52,
> Miss=16895, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.3068389603868127%,
> Miss Ratio=99
> .69316124916077%, Evicted/Run=NaN
> 2010-05-12 19:22:45,069 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col5/7617863559659933969,
> isReference=false, seque
> nce id=2470632548, length=8456716, majorCompaction=false
> 2010-05-12 19:22:45,075 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col6/1328113038200437659,
> isReference=false, seque
> nce id=2960732840, length=19861, majorCompaction=false
> 2010-05-12 19:22:45,078 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col6/6484804359703635950,
> isReference=false, seque
> nce id=2470632548, length=8456716, majorCompaction=false
> 2010-05-12 19:22:45,082 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col7/1673569837212457160,
> isReference=false, seque
> nce id=2960732840, length=19861, majorCompaction=false
> 2010-05-12 19:22:45,085 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col7/4737399093829085995,
> isReference=false, seque
> nce id=2470632548, length=8456716, majorCompaction=false
> 2010-05-12 19:22:47,238 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col8/8446828932792437464,
> isReference=false, seque
> nce id=2960732840, length=19861, majorCompaction=false2010-05-12
> 19:22:47,241 DEBUG org.apache.hadoop.hbase.regionserver.Store: loaded
> /hbase/DocData/1651418343/col8/974386128174268353, isReference=false, sequen
> ce id=2470632548, length=8456716, majorCompaction=false
> 2010-05-12 19:22:48,804 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col9/2096232603557969237,
> isReference=false, seque
> nce id=2470632548, length=8456716, majorCompaction=false
> 2010-05-12 19:22:48,807 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/1651418343/col9/7088206045660348092,
> isReference=false, seque
> nce id=2960732840, length=19861, majorCompaction=false
> 2010-05-12 19:22:48,808 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> region DocData,4824176,1273625075099/1651418343 available; sequence id is
> 29607328
> 41
> 2010-05-12 19:22:48,808 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
> DocData,40682172,1273607630618
> 2010-05-12 19:22:48,809 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Opening region DocData,40682172,1273607630618, encoded=271889952
> 2010-05-12 19:22:50,924 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/271889952/CONTENT/4859380626868896307,
> isReference=false, sequence id=2959849236, length=337563,
> majorCompaction=false2010-05-12 19:22:53,037 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: loaded
> /hbase/DocData/271889952/CONTENT/952776139755887312, isReference=false, sequ
> ence id=2082553088, length=110460013, majorCompaction=false
> 2010-05-12 19:22:57,404 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> loaded /hbase/DocData/271889952/col1/66449684560689857, isReference=false,
> sequence
>  id=2959849236, length=12648, majorCompaction=false
> 2010-05-12 19:23:16,165 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
> DocData,40682172,1273607630618
> java.lang.OutOfMemoryError: Java heap space
>         at java.io.BufferedInputStream.<init>(BufferedInputStream.java:178)
>         at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1369)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1626)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$FixedFileTrailer.deserialize(HFile.java:1372)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readTrailer(HFile.java:848)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:793)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
>         at
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1549)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:312)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1564)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1531)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1451)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-05-12 19:23:18,246 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError,
> aborting.
> java.lang.OutOfMemoryError: Java heap space
>         at java.io.BufferedInputStream.<init>(BufferedInputStream.java:178)
>         at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1369)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1626)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1743)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$FixedFileTrailer.deserialize(HFile.java:1372)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readTrailer(HFile.java:848)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.loadFileInfo(HFile.java:793)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:273)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFile.<init>(StoreFile.java:129)
>         at
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:410)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1549)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:312)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1564)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1531)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1451)
>         at java.lang.Thread.run(Thread.java:619)
> 2010-05-12 19:23:18,246 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=942, stores=9411, storefiles=19887,
> storefileIndexSize=182, memstoreSize=0, compactionQueueSize=0,
> usedHeap=2999, maxHeap=2999, blockCacheSize=8847696,
> blockCacheFree=1878235312, blockCacheCount=1, blockCacheHitRatio=0,
> fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
> 2010-05-12 19:23:18,247 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
> 2010-05-12 19:23:18,254 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 0 on 60020: exiting
> 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 1 on 60020: exiting
> 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 3 on 60020: exiting
> 2010-05-12 19:23:18,255 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 2 on 60020: exiting
> And so on (The region server has a total of 100 handlers)..
>
>
>

Re: Regionservers crash with an OutOfMemoryException after a data-intensive map reduce job..

Reply via email to