[
https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061397#comment-17061397
]
Michael Stack commented on HBASE-23984:
---------------------------------------
!Screen Shot 2020-03-17 at 9.46.49 PM.png!
This one keeps failing... about 30% of the time. It happens pretty frequently
in local runs too.
Studying this some, it can happen in other test teardowns in other tests --
locally at least -- and there is another form of this failure that happens
where the problem happens when AssignRegionHandler runs during cluster shutdown
and the shutdown sequence is out of the usual ordering (the
TestMasterAbortAndRSGotKilled issue is when UnassignRegionHandler runs at
cluster shutdown). In both cases, the close that goes with the unassign or the
open that goes w/ the assign, they can fail in a not usual way whe the cluster
is going down and the sequencing strays from the usual. When this happens, the
exception complaining we can't close/open because server aborting happens
WITHOUT clearing the region entry from the RegionServer regionsInTransitionInRS
Map and so we are destined to circle the waitOnAllRegionsToClose loop until the
test timesout.
Looking at the handler code in UnassignRegionHandler and in
AssignRegionHandler, they need a bit of restructuring so the clearing of the
Region from regionsInTransitionInRS happens even on error. CloseRegionHandler
does this already. Code in HRegionServer#closeRegion duplicates much of what is
in CloseRegionHandler. Code in UnassignRegionHandler needs to close a Region
and again dupes CloseRegionHandler.
Attached patch addresses the non-clearing of this.regionsInTransitionInRS in
the respective Handlers in a finally block making shutdown succeed more often
and it cleans up some of the duplication.
> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
> Key: HBASE-23984
> URL: https://issues.apache.org/jira/browse/HBASE-23984
> Project: HBase
> Issue Type: Bug
> Reporter: Michael Stack
> Priority: Major
> Attachments: Screen Shot 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems
> basic. There is an unassign/move going on. Test just checks Master can come
> back up after being killed. Does not check move is done. If on subsequent
> cluster shutdown, if the move can't report the Master because its shutting
> down, then the move fails, we abort the server, and then we get a wonky loop
> where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we
> don't cleanup references in the regionserver local RIT accounting. Deeper
> than this, close code is duplicated in three places that I can see; in
> RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its
> probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108:
> Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover
> *****
> Cause:
> java.io.IOException: Failed to report close to master:
> ede67f9f661acc1241faf468b081d548
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails
> because we are aborting because of above....
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0]
> helpers.MarkerIgnoringBase(159): ***** ABORTING region server
> asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while
> closing region
> hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still
> finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
> at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
> at
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted
> and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region
> close, it does not unregister the Region with
> rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)