[
https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177601#comment-17177601
]
Guanghao Zhang commented on HBASE-23984:
----------------------------------------
I meet this problem on branch-2.2 too. This case happened because the
DelayCloseCP. The event execute order is:
# Close regiong. But because the DelayCloseCP, it will close after 10 seconds.
# Finish ut and shutdown cluster.
# Shutdown master.
# Shutdown RS. Call waitOnAllRegionsToClose method. But abortRequested is
false now.
# Close region and failed because master is down and report master error. Then
abort RegionServer and set abortRequested to ture.
# waitOnAllRegionsToClose hanged because the online regions cannot be empty.
waitOnAllRegionsToClose(final boolean abort) already consider the abort case
but the problem is abortRequested is false when call this method. I thought the
fix should be that keep to check the abortRequested in waitOnAllRegionsToClose
method internal.
> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
> Key: HBASE-23984
> URL: https://issues.apache.org/jira/browse/HBASE-23984
> Project: HBase
> Issue Type: Test
> Affects Versions: 2.3.0, 2.4.0
> Reporter: Michael Stack
> Assignee: Michael Stack
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
> Attachments:
> 0001-HBASE-23984-Flakey-Tests-TestMasterAbortAndRSGotKill.patch, Screen Shot
> 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems
> basic. There is an unassign/move going on. Test just checks Master can come
> back up after being killed. Does not check move is done. If on subsequent
> cluster shutdown, if the move can't report the Master because its shutting
> down, then the move fails, we abort the server, and then we get a wonky loop
> where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we
> don't cleanup references in the regionserver local RIT accounting. Deeper
> than this, close code is duplicated in three places that I can see; in
> RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its
> probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108:
> Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover
> *****
> Cause:
> java.io.IOException: Failed to report close to master:
> ede67f9f661acc1241faf468b081d548
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails
> because we are aborting because of above....
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0]
> helpers.MarkerIgnoringBase(159): ***** ABORTING region server
> asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while
> closing region
> hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still
> finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
> at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
> at
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted
> and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region
> close, it does not unregister the Region with
> rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)