[ 
https://issues.apache.org/jira/browse/HBASE-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244815#comment-17244815
 ] 

Hudson commented on HBASE-25353:
--------------------------------

Results for branch branch-2
        [build #123 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/123/]:
 (/) *{color:green}+1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/123/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/123/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/123/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/123/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> [Flakey Tests] branch-2 TestShutdownBackupMaster
> ------------------------------------------------
>
>                 Key: HBASE-25353
>                 URL: https://issues.apache.org/jira/browse/HBASE-25353
>             Project: HBase
>          Issue Type: Sub-task
>          Components: flakies
>    Affects Versions: 2.4.0
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 2.3.4, 2.5.0, 2.4.1
>
>
> Making this as a sub-issue of parent issue which fails similar to how we are 
> failing now.
> Currently, I see that that TestShutdownBackupMaster test passes usually but 
> it is warped in how it completes. It will do all retries just before the test 
> timesout at 13minutes max...: e.g. you'll see this...
> 2020-12-02 22:07:34,200 DEBUG [master/stack:0:becomeActiveMaster] 
> client.ConnectionImplementation(1009): locateRegionInMeta 
> parentTable='hbase:meta', attempt=44 of 46 failed; retrying after sleep of 46
> ... so we'll do all the retries and then complete so the test looks like it 
> 'succeeded' but it actually ran for Total time: 12:41 min... and the log is 
> full of thread dumps because the cluster won't go down (The time is spent in 
> the test shutdown).
> Often though, we won't complete the retries in time and the test fails. It is 
> in the flakey list.
> Rather, we are supposed to fail out fast when we are shutting down. Below is 
> the type of retry we see.
>  
> {code:java}
> 2020-12-02 10:53:35,540 INFO [Listener at localhost/61609] 
> util.JVMClusterUtil(348): Shutdown of 2 master(s) and 2 regionserver(s) 
> complete
>  2020-12-02 10:53:35,548 DEBUG [master/stack:0:becomeActiveMaster] 
> client.ConnectionImplementation(1009): locateRegionInMeta 
> parentTable='hbase:meta', attempt=2 of 46 failed; retrying after sleep of 46
>  org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1afa7f5b closed
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:630)
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:815)
>  at 
> org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:803)
>  at 
> org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.relocateRegion(ConnectionUtils.java:138)
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:933)
>  at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:823)
>  at 
> org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
>  at 
> org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:64)
>  at 
> org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:70)
>  at 
> org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:59)
>  at 
> org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
>  at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:383)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:357)
>  at 
> org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
>  at 
> org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
>  at 
> org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
>  at 
> org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
>  at 
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
>  at 
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1224)
>  at 
> org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:68)
>  at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1021)
>  at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2082)
>  at 
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:506){code}
> See how a master is trying to become active and it won't relent trying to 
> become active master even though this cluster is shutting down? See how we 
> retry but the check for close of the connection is coming back with a 
> DoNotRetryIOException? The exception is being swallowed. We keep going.
> Fix looks simple enough.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to