[
https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Stack resolved HBASE-23984.
-----------------------------------
Fix Version/s: 2.3.0
3.0.0
Hadoop Flags: Reviewed
Assignee: Michael Stack
Resolution: Fixed
Pushed on branch-2.3, branch-2, and master (didn't mark this 2.4.0 as per
comment by Duo up on dev list... until there is a 2.3.0 and an issue is in
both, then its 2.3.0, not 2.4.0).
Thanks for the reviews. Linked follow-ons to this issue.
> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
> Key: HBASE-23984
> URL: https://issues.apache.org/jira/browse/HBASE-23984
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.3.0, 2.4.0
> Reporter: Michael Stack
> Assignee: Michael Stack
> Priority: Major
> Fix For: 3.0.0, 2.3.0
>
> Attachments:
> 0001-HBASE-23984-Flakey-Tests-TestMasterAbortAndRSGotKill.patch, Screen Shot
> 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems
> basic. There is an unassign/move going on. Test just checks Master can come
> back up after being killed. Does not check move is done. If on subsequent
> cluster shutdown, if the move can't report the Master because its shutting
> down, then the move fails, we abort the server, and then we get a wonky loop
> where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we
> don't cleanup references in the regionserver local RIT accounting. Deeper
> than this, close code is duplicated in three places that I can see; in
> RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its
> probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108:
> Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover
> *****
> Cause:
> java.io.IOException: Failed to report close to master:
> ede67f9f661acc1241faf468b081d548
> at
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails
> because we are aborting because of above....
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0]
> helpers.MarkerIgnoringBase(159): ***** ABORTING region server
> asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while
> closing region
> hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still
> finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
> at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
> at
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted
> and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region
> close, it does not unregister the Region with
> rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);
--
This message was sent by Atlassian Jira
(v8.3.4#803005)