[jira] [Commented] (HBASE-23984) [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown

Michael Stack (Jira) Thu, 19 Mar 2020 10:08:25 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062767#comment-17062767
 ]


Michael Stack commented on HBASE-23984:
---------------------------------------

Thanks [~zhangduo].

bq. ....so whether or not to remove an entry from the regionsInTransitionInRS 
does not matter...

But it does. If we don't remove ourselves from regionsInTransitionInRS, even 
though we are aborting, the RS won't go down. It gets stuck in the 
RS#waitOnAllRegionsToClose loop.

bq.  If we hang in shutdown then it is the problem for the shutdown method, not 
the problem for UnassignRegionHandler.

Nah. The shutdown method looks at online regions and regionsInTransitionInRS. 
Trying to have shutdown do compensation for a Handler that does not do proper 
state handling will complicate logic needlessly in an already involved process.

> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
>                 Key: HBASE-23984
>                 URL: https://issues.apache.org/jira/browse/HBASE-23984
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Priority: Major
>         Attachments: Screen Shot 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems 
> basic. There is an unassign/move going on. Test just checks Master can come 
> back up after being killed. Does not check move is done. If on subsequent 
> cluster shutdown, if the move can't report the Master because its shutting 
> down, then the move fails, we abort the server, and then we get a wonky loop 
> where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we 
> don't cleanup references in the regionserver local RIT accounting. Deeper 
> than this, close code is duplicated in three places that I can see; in 
> RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its 
> probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: 
> Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover 
> *****
> Cause:
> java.io.IOException: Failed to report close to master: 
> ede67f9f661acc1241faf468b081d548
>       at 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails 
> because we are aborting because of above.... 
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] 
> helpers.MarkerIgnoringBase(159): ***** ABORTING region server 
> asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while 
> closing region 
> hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still 
> finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted 
> and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region 
> close, it does not unregister the Region with 
> rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23984) [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown

Reply via email to