[ 
https://issues.apache.org/jira/browse/HBASE-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395229#comment-16395229
 ] 

Eshcar Hillel edited comment on HBASE-20090 at 3/12/18 1:27 PM:
----------------------------------------------------------------

I am trying to see if I understand correctly the discussed scenario:
 * Region A (TestTable) and region B(Atlas table) are the only two regions in 
the region server.
 * The region server exceeds the global pressure threshold (380MB)
 * Region A has 400MB while region B has 0MB
 * Region A has more than 5GB of data on disk (6.9GB) hence a split is 
triggered.
 * Finally, (this is what is not 100% clear) in the next loop calling to 
flushOneForGlobalPressure region A is not selected (correct?), region B is 
selected but since it has size 0 the precondition test throws an exception.

Is this the right way to describe the problem? Can you explain again why region 
A is not selected, is it because its size is set to 0 or since it is marked as 
non-flushable?

To summarize, if I understand correctly there is a period of time where the 
global memstore size is high (>380MB) and at the same time the region holding 
all this data is being split and hence the data is in the process of being 
flushed, as a result the memstore flusher that is triggered due to global 
pressure finds itself in an inconsistent state:

{color:#008000}"Above memory mark but there are no flushable regions!"{color}

{color:#333333}Do you think that adding a check in line 187 would solve the 
problem?{color}
{code:java}
if (bestAnyRegion == null 
    ||  bestAnyRegion.getMemStoreDataSize() == 0) // NEW
{
  LOG.error("Above memory mark but there are no flushable regions!");
  return false;
}{code}
 


was (Author: eshcar):
I am trying to see if I understand correctly the discussed scenario:
 * Region A (TestTable) and region B(Atlas table) are the only two regions in 
the region server.
 * The region server exceeds the global pressure threshold (380MB)
 * Region A has 400MB while region B has 0MB
 * Region A has more than 5GB of data on disk (6.9GB) hence a split is 
triggered.
 * Finally, (this is what is not 100% clear:) in the next loop calling to 
flushOneForGlobalPressure region A is not selected (correct?), region B is 
selected but since it has size 0 the precondition test throws an exception.

Is this the right way to describe the problem? Can you explain again why region 
A is not selected, is it because its size is set to 0 or since it is marked as 
non-flushable?

To summarize, if I understand correctly there is a period of time where the 
global memstore size is high (>380MB) and at the same time the region holding 
all this data is being split and hence the data is in the process of being 
flushed, as a result the memstore flusher that is triggered due to global 
pressure finds itself in an inconsistent state:

{color:#008000}"Above memory mark but there are no flushable regions!"{color}

{color:#333333}Do you think that adding a check in line 187 would solve the 
problem?
{color}
{code:java}
if (bestAnyRegion == null 
    ||  bestAnyRegion.getMemStoreDataSize() == 0) // NEW
{
  LOG.error("Above memory mark but there are no flushable regions!");
  return false;
}{code}
 

> Properly handle Preconditions check failure in 
> MemStoreFlusher$FlushHandler.run
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-20090
>                 URL: https://issues.apache.org/jira/browse/HBASE-20090
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Major
>         Attachments: 20090-server-61260-01-000007.log, 20090.v6.txt
>
>
> Copied the following from a comment since this was better description of the 
> race condition.
> The original description was merged to the beginning of my first comment 
> below.
> With more debug logging, we can see the scenario where the exception was 
> triggered.
> {code}
> 2018-03-02 17:28:30,097 DEBUG [MemStoreFlusher.0] regionserver.CompactSplit: 
> Splitting TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085., 
> compaction_queue=(0:0),     split_queue=1
> 2018-03-02 17:28:30,098 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16020] 
> regionserver.IncreasingToUpperBoundRegionSplitPolicy: ShouldSplit because 
> info      size=6.9G, sizeToCheck=256.0M, regionsWithCommonTable=1
> 2018-03-02 17:28:30,296 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=24,queue=0,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,297 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush thread woke up because memory above low 
> water=381.5 M
> 2018-03-02 17:28:30,297 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=25,queue=1,port=16020] 
> regionserver.MemStoreFlusher: wake up flusher due to ABOVE_ONHEAP_LOWER_MARK
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: region 
> TestTable,,1520011528142.0453f29030757eedb6e6a1c57e88c085. with size 400432696
> 2018-03-02 17:28:30,298 DEBUG [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. with size 0
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Flush of region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. due to global    
>  heap pressure. Flush type=ABOVE_ONHEAP_LOWER_MARKTotal Memstore Heap 
> size=381.9 MTotal Memstore Off-Heap size=0, Region memstore size=0
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: wake up by WAKEUPFLUSH_INSTANCE
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Nothing to flush for 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae.
> 2018-03-02 17:28:30,298 INFO  [MemStoreFlusher.1] 
> regionserver.MemStoreFlusher: Excluding unflushable region 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. -    trying to 
> find a different region to flush.
> {code}
> Region 0453f29030757eedb6e6a1c57e88c085 was being split.
> In HRegion#flushcache, the log from else branch can be seen in 
> 20090-server-61260-01-000007.log :
> {code}
>       synchronized (writestate) {
>         if (!writestate.flushing && writestate.writesEnabled) {
>           this.writestate.flushing = true;
>         } else {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug("NOT flushing memstore for region " + this
>                 + ", flushing=" + writestate.flushing + ", writesEnabled="
>                 + writestate.writesEnabled);
>           }
> {code}
> Meaning, region 0453f29030757eedb6e6a1c57e88c085 couldn't flush, leaving 
> memory pressure at high level.
> When MemStoreFlusher ran to the following call, the region was no longer a 
> flush candidate:
> {code}
>       HRegion bestFlushableRegion =
>           getBiggestMemStoreRegion(regionsBySize, excludedRegions, true);
> {code}
> So the other region, 
> atlas_janus,,1519927429371.fbcb5e495344542daf8b499e4bac03ae. , was examined 
> next. Since the region was not receiving write, the (current) Precondition 
> check failed.
> The proposed fix is to convert the Precondition to normal return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to