[jira] [Commented] (HBASE-23984) [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown

Hudson (Jira) Sat, 21 Mar 2020 09:50:46 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063982#comment-17063982
 ]


Hudson commented on HBASE-23984:
--------------------------------

Results for branch branch-2.3
        [build #3 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/3/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/3//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/3//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/3//JDK8_Nightly_Build_Report_(Hadoop3)/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/3//JDK11_Nightly_Build_Report/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
--Failed when running client tests on top of Hadoop 2. [see log for 
details|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/3//artifact/output-integration/hadoop-2.log].
 (note that this means we didn't run on Hadoop 3)


> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
>                 Key: HBASE-23984
>                 URL: https://issues.apache.org/jira/browse/HBASE-23984
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0
>
>         Attachments: 
> 0001-HBASE-23984-Flakey-Tests-TestMasterAbortAndRSGotKill.patch, Screen Shot 
> 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems 
> basic. There is an unassign/move going on. Test just checks Master can come 
> back up after being killed. Does not check move is done. If on subsequent 
> cluster shutdown, if the move can't report the Master because its shutting 
> down, then the move fails, we abort the server, and then we get a wonky loop 
> where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we 
> don't cleanup references in the regionserver local RIT accounting. Deeper 
> than this, close code is duplicated in three places that I can see; in 
> RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From 
> https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its 
> probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: 
> Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover 
> *****
> Cause:
> java.io.IOException: Failed to report close to master: 
> ede67f9f661acc1241faf468b081d548
>       at 
> org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails 
> because we are aborting because of above.... 
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] 
> helpers.MarkerIgnoringBase(159): ***** ABORTING region server 
> asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while 
> closing region 
> hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still 
> finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
>       at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted 
> and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region 
> close, it does not unregister the Region with 
> rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23984) [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown

Reply via email to