[ https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Stack updated HBASE-23984: ---------------------------------- Attachment: Screen Shot 2020-03-17 at 9.46.49 PM.png > [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown > -------------------------------------------------------------- > > Key: HBASE-23984 > URL: https://issues.apache.org/jira/browse/HBASE-23984 > Project: HBase > Issue Type: Bug > Reporter: Michael Stack > Priority: Major > Attachments: Screen Shot 2020-03-17 at 9.46.49 PM.png > > > Its failing with decent frequency of late in shutdown of cluster. Seems > basic. There is an unassign/move going on. Test just checks Master can come > back up after being killed. Does not check move is done. If on subsequent > cluster shutdown, if the move can't report the Master because its shutting > down, then the move fails, we abort the server, and then we get a wonky loop > where we can't close because server is aborting. > At the root, there is a misaccounting when the unassign close fails where we > don't cleanup references in the regionserver local RIT accounting. Deeper > than this, close code is duplicated in three places that I can see; in > RegionServer, in CloseRegionHandler, and in UnassignRegionHandler. > Let me fix this issue and the code dupe. > Details: > From > https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt > Here is the unassign handler failing because master went down earlier (Its > probably trying to talk to the old Master location) > {code} > ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: > Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover > ***** > Cause: > java.io.IOException: Failed to report close to master: > ede67f9f661acc1241faf468b081d548 > at > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > ... then the cluster shutdown tries to close the same Region... but fails > because we are aborting because of above.... > {code} > 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] > helpers.MarkerIgnoringBase(159): ***** ABORTING region server > asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while > closing region > hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still > finishing close ***** > java.io.IOException: Aborting flush because server is aborted... > at > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495) > at > org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650) > at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552) > at > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > .... > And the RS keeps looping trying to close the Region even though we're aborted > and there is handling in RS close Regions to deal with abort. > Trouble seems to be because when UnassignRegionHandler fails its region > close, it does not unregister the Region with > rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE); -- This message was sent by Atlassian Jira (v8.3.4#803005)