[ 
https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-23984.
-----------------------------------
    Fix Version/s: 2.3.0
                   3.0.0
     Hadoop Flags: Reviewed
         Assignee: Michael Stack
       Resolution: Fixed

Pushed on branch-2.3, branch-2, and master (didn't mark this 2.4.0 as per 
comment by Duo up on dev list... until there is a 2.3.0 and an issue is in 
both, then its 2.3.0, not 2.4.0).

Thanks for the reviews. Linked follow-ons to this issue.

> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
>                 Key: HBASE-23984
>                 URL: https://issues.apache.org/jira/browse/HBASE-23984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0
>
>         Attachments: 
> 0001-HBASE-23984-Flakey-Tests-TestMasterAbortAndRSGotKill.patch, Screen Shot 
> 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems 
> basic. There is an unassign/move going on. Test just checks Master can come 
> back up after being killed. Does not check move is done. If on subsequent 
> cluster shutdown, if the move can't report the Master because its shutting 
> down, then the move fails, we abort the server, and then we get a wonky loop 
> where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we 
> don't cleanup references in the regionserver local RIT accounting. Deeper 
> than this, close code is duplicated in three places that I can see; in 
> RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its 
> probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: 
> Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover 
> *****
> Cause:
> java.io.IOException: Failed to report close to master: 
> ede67f9f661acc1241faf468b081d548
>       at 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails 
> because we are aborting because of above.... 
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] 
> helpers.MarkerIgnoringBase(159): ***** ABORTING region server 
> asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while 
> closing region 
> hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still 
> finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted 
> and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region 
> close, it does not unregister the Region with 
> rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to