Michael Stack created HBASE-25353:
-------------------------------------
Summary: [Flakey Tests] branch-2 TestShutdownBackupMaster
Key: HBASE-25353
URL: https://issues.apache.org/jira/browse/HBASE-25353
Project: HBase
Issue Type: Sub-task
Components: flakies
Affects Versions: 2.4.0
Reporter: Michael Stack
Assignee: Michael Stack
Fix For: 2.4.1
Making this as a sub-issue of parent issue which fails similar to how we are
failing now.
Currently, I see that that TestShutdownBackupMaster test passes usually but it
is warped in how it completes. It will do all retries just before the test
timesout at 13minutes max...: e.g. you'll see this...
2020-12-02 22:07:34,200 DEBUG [master/stack:0:becomeActiveMaster]
client.ConnectionImplementation(1009): locateRegionInMeta
parentTable='hbase:meta', attempt=44 of 46 failed; retrying after sleep of 46
... so we'll do all the retries and then complete so the test looks like it
'succeeded' but it actually ran for Total time: 12:41 min... and the log is
full of thread dumps because the cluster won't go down (The time is spent in
the test shutdown).
Often though, we won't complete the retries in time and the test fails. It is
in the flakey list.
Rather, we are supposed to fail out fast when we are shutting down. Below is
the type of retry we see.
{code:java}
2020-12-02 10:53:35,540 INFO [Listener at localhost/61609]
util.JVMClusterUtil(348): Shutdown of 2 master(s) and 2 regionserver(s) complete
2020-12-02 10:53:35,548 DEBUG [master/stack:0:becomeActiveMaster]
client.ConnectionImplementation(1009): locateRegionInMeta
parentTable='hbase:meta', attempt=2 of 46 failed; retrying after sleep of 46
org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1afa7f5b closed
at
org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:630)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:815)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:803)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.relocateRegion(ConnectionUtils.java:138)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:933)
at
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:823)
at
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
at
org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:64)
at
org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:70)
at
org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:59)
at
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
at
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:383)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:357)
at
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
at
org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
at
org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
at
org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
at
org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
at
org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1224)
at
org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:68)
at
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1021)
at
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2082)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:506){code}
See how a master is trying to become active and it won't relent trying to
become active master even though this cluster is shutting down? See how we
retry but the check for close of the connection is coming back with a
DoNotRetryIOException? The exception is being swallowed. We keep going.
Fix looks simple enough.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)