[
https://issues.apache.org/jira/browse/HBASE-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574986#comment-16574986
]
Mike Drob commented on HBASE-21031:
-----------------------------------
Very interesting failure scenario, Allan. Great job diagnosing it.
I think I have feedback for the patch, would you mind uploading to review board?
> Memory leak if replay edits failed during region opening
> --------------------------------------------------------
>
> Key: HBASE-21031
> URL: https://issues.apache.org/jira/browse/HBASE-21031
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.0, 2.0.1
> Reporter: Allan Yang
> Assignee: Allan Yang
> Priority: Major
> Attachments: HBASE-21031.branch-2.0.001.patch, memoryleak.png
>
>
> Due to HBASE-21029, when replaying edits with a lot of same cells, the
> memstore won't flush, a exception will throw when all heap space was used:
> {code}
> 2018-08-06 15:52:27,590 ERROR
> [RS_OPEN_REGION-regionserver/hb-bp10cw4ejoy0a2f3f-009:16020-2]
> handler.OpenRegionHandler(302): Failed open of
> region=hbase_test,dffa78,1531227033378.cbf9a2daf3aaa0c7e931e9c9a7b53f41.,
> starting to roll back the global memstore size.
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at
> org.apache.hadoop.hbase.regionserver.OnheapChunk.allocateDataBuffer(OnheapChunk.java:41)
> at org.apache.hadoop.hbase.regionserver.Chunk.init(Chunk.java:104)
> at
> org.apache.hadoop.hbase.regionserver.ChunkCreator.getChunk(ChunkCreator.java:226)
> at
> org.apache.hadoop.hbase.regionserver.ChunkCreator.getChunk(ChunkCreator.java:180)
> at
> org.apache.hadoop.hbase.regionserver.ChunkCreator.getChunk(ChunkCreator.java:163)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.getOrMakeChunk(MemStoreLABImpl.java:273)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.copyCellInto(MemStoreLABImpl.java:148)
> at
> org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.copyCellInto(MemStoreLABImpl.java:111)
> at
> org.apache.hadoop.hbase.regionserver.Segment.maybeCloneWithAllocator(Segment.java:178)
> at
> org.apache.hadoop.hbase.regionserver.AbstractMemStore.maybeCloneWithAllocator(AbstractMemStore.java:287)
> at
> org.apache.hadoop.hbase.regionserver.AbstractMemStore.add(AbstractMemStore.java:107)
> at org.apache.hadoop.hbase.regionserver.HStore.add(HStore.java:706)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.restoreEdit(HRegion.java:5494)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4608)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:4404)
> {code}
> After this exception, the memstore did not roll back, and since MSLAB is
> used, all the chunk allocated won't release for ever. Those memory is leak
> forever...
> We need to rollback the memory if open region fails(For now, only global
> memstore size is decreased after failure).
> Another problem is that we use replayEditsPerRegion in RegionServerAccounting
> to record how many memory used during replaying. And decrease the global
> memstore size if replay fails. This is not right, since during replaying, we
> may also flush the memstore, the size in the map of replayEditsPerRegion is
> not accurate at all!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)