[ 
https://issues.apache.org/jira/browse/HBASE-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13609707#comment-13609707
 ] 

Enis Soztutar commented on HBASE-8143:
--------------------------------------

I was able to repro this by using a very simple test. BlockReaderLocal just 
allocates 1M of direct buffer by default, and thus >3000 blocks open causes 3g 
mem allocation (assuming no checksum). 

These are the configurations. 
{code}
  public static final String DFS_CLIENT_READ_SHORTCIRCUIT_KEY = 
"dfs.client.read.shortcircuit";
  public static final boolean DFS_CLIENT_READ_SHORTCIRCUIT_DEFAULT = false;
  public static final String DFS_CLIENT_READ_SHORTCIRCUIT_SKIP_CHECKSUM_KEY = 
"dfs.client.read.shortcircuit.skip.checksum";
  public static final boolean 
DFS_CLIENT_READ_SHORTCIRCUIT_SKIP_CHECKSUM_DEFAULT = false;
  public static final String DFS_CLIENT_READ_SHORTCIRCUIT_BUFFER_SIZE_KEY = 
"dfs.client.read.shortcircuit.buffer.size";
  public static final int DFS_CLIENT_READ_SHORTCIRCUIT_BUFFER_SIZE_DEFAULT = 
1024 * 1024;
{code}

Although the buffers are allocated in a pool with weak references, in HBase, we 
keep the streams open, and thus cause the inflation. There is no guard against 
allocating the buffers in DFSClient or BlockReaderLocal. 

Decreasing the size of the buffers dfs.client.read.shortcircuit.buffer.size, 
and not-having that many open files should help with the case. It is not clear 
that the extra buffering in hadoop 2 helps in case of reads coming from HBase. 

                
> HBase on Hadoop 2 with local short circuit reads (ssr) causes OOM 
> ------------------------------------------------------------------
>
>                 Key: HBASE-8143
>                 URL: https://issues.apache.org/jira/browse/HBASE-8143
>             Project: HBase
>          Issue Type: Bug
>          Components: hadoop2
>    Affects Versions: 0.95.0, 0.98.0, 0.94.7
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.95.0, 0.98.0, 0.94.7
>
>
> We've run into an issue with HBase 0.94 on Hadoop2, with SSR turned on that 
> the memory usage of the HBase process grows to 7g, on an -Xmx3g, after some 
> time, this causes OOM for the RSs. 
> Upon further investigation, I've found out that we end up with 200 regions, 
> each having 3-4 store files open. Under hadoop2 SSR, BlockReaderLocal 
> allocates DirectBuffers, which is unlike HDFS 1 where there is no direct 
> buffer allocation. 
> It seems that there is no guards against the memory used by local buffers in 
> hdfs 2, and having a large number of open files causes multiple GB of memory 
> to be consumed from the RS process. 
> This issue is to further investigate what is going on. Whether we can limit 
> the memory usage in HDFS, or HBase, and/or document the setup. 
> Possible mitigation scenarios are: 
>  - Turn off SSR for Hadoop 2
>  - Ensure that there is enough unallocated memory for the RS based on 
> expected # of store files
>  - Ensure that there is lower number of regions per region server (hence 
> number of open files)
> Stack trace:
> {code}
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> IntegrationTestLoadAndVerify,yC^P\xD7\x945\xD4,1363388517630.24655343d8d356ef708732f34cfe8946.
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1560)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1439)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1380)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:449)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:215)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$500(MemStoreFlusher.java:63)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:237)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.OutOfMemoryError: Direct buffer memory
>         at java.nio.Bits.reserveMemory(Bits.java:632)
>         at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:97)
>         at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
>         at 
> org.apache.hadoop.hdfs.util.DirectBufferPool.getBuffer(DirectBufferPool.java:70)
>         at 
> org.apache.hadoop.hdfs.BlockReaderLocal.<init>(BlockReaderLocal.java:315)
>         at 
> org.apache.hadoop.hdfs.BlockReaderLocal.newBlockReader(BlockReaderLocal.java:208)
>         at 
> org.apache.hadoop.hdfs.DFSClient.getLocalBlockReader(DFSClient.java:790)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:888)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
>         at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at 
> org.apache.hadoop.hbase.io.hfile.FixedFileTrailer.readFromStream(FixedFileTrailer.java:312)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFile.pickReaderVersion(HFile.java:543)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:589)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:512)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
>         at 
> org.apache.hadoop.hbase.regionserver.Store.validateStoreFile(Store.java:1568)
>         at 
> org.apache.hadoop.hbase.regionserver.Store.commitFile(Store.java:845)
>         at 
> org.apache.hadoop.hbase.regionserver.Store.access$500(Store.java:109)
>         at 
> org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.commit(Store.java:2209)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1541)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to