[GitHub] [hbase] bharathv commented on a change in pull request #1690: HBASE-24327 : Flaky connection in TestMasterShutdown#testMasterShutdo…

GitBox Thu, 14 May 2020 14:45:26 -0700


bharathv commented on a change in pull request #1690:
URL: https://github.com/apache/hbase/pull/1690#discussion_r425448865




##########
File path: 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterShutdown.java
##########
@@ -151,19 +152,46 @@ public void 
testMasterShutdownBeforeStartingAnyRegionServer() throws Exception {
       hbaseCluster = new LocalHBaseCluster(htu.getConfiguration(), 
options.getNumMasters(),
         options.getNumRegionServers(), options.getMasterClass(), 
options.getRsClass());
       final MasterThread masterThread = hbaseCluster.getMasters().get(0);
+
       masterThread.start();
-      // Switching to master registry exacerbated a race in the master 
bootstrap that can result
-      // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is 
essentially because
-      // the server manager in HMaster is not initialized by the time 
shutdown() RPC (below) is
-      // made to the master. The suspected reason as to why it was uncommon 
before HBASE-18095
-      // is because the connection creation with ZK registry is so slow that 
by then the server
-      // manager is usually init'ed in time for the RPC to be made. For now, 
adding an explicit
-      // wait() in the test, waiting for the server manager to become 
available.
-      final long timeout = TimeUnit.MINUTES.toMillis(10);
-      assertNotEquals("Timeout waiting for server manager to become 
available.",
-        -1, Waiter.waitFor(htu.getConfiguration(), timeout,
-          () -> masterThread.getMaster().getServerManager() != null));
-      htu.getConnection().getAdmin().shutdown();
+      final CompletableFuture<Void> shutdownFuture = 
CompletableFuture.runAsync(() -> {
+        // Switching to master registry exacerbated a race in the master 
bootstrap that can result
+        // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is 
essentially because
+        // the server manager in HMaster is not initialized by the time 
shutdown() RPC (below) is
+        // made to the master. The suspected reason as to why it was uncommon 
before HBASE-18095
+        // is because the connection creation with ZK registry is so slow that 
by then the server
+        // manager is usually init'ed in time for the RPC to be made. For now, 
adding an explicit
+        // wait() in the test, waiting for the server manager to become 
available.
+        final long timeout = TimeUnit.MINUTES.toMillis(10);
+        assertNotEquals("timeout waiting for server manager to become 
available.", -1,
+          htu.waitFor(timeout, () -> 
masterThread.getMaster().getServerManager() != null));
+
+        // Master has come up far enough that we can terminate it without 
creating a zombie.
+        LOG.debug("Attempting to establish connection.");
+        try {
+          // HBASE-24327 : (Resolve Flaky connection issues)
+          // shutdown() RPC can have flaky ZK connection issues.
+          // e.g
+          // ERROR 
[RpcServer.priority.RWQ.Fifo.read.handler=1,queue=1,port=53033]
+          // master.HMaster(2878): ZooKeeper exception trying to set cluster 
as down in ZK
+          // org.apache.zookeeper.KeeperException$SystemErrorException:
+          // KeeperErrorCode = SystemError
+          //
+          // However, even when above flakes happen, shutdown call does get 
completed even if
+          // RPC call has failure. Hence, subsequent retries will never 
succeed as HMaster is
+          // already shutdown. Hence, it can fail. To resolve it, after making 
one shutdown()
+          // call, we are ignoring IOException.
+          htu.getConnection().getAdmin().shutdown();
+          LOG.info("Shutdown RPC sent.");
+        } catch (IOException | CompletionException e) {
+          LOG.warn("Failed to establish connection.", e);

Review comment:
       I think the MasterRegistryFetchException is only valid if we re-try 
shutdown() rpcs. First shutdown() throws an IOException (some subclass of it) 
and shuts down the master, the second (and later) shutdowns can't find a master 
to issue a shutdown and the stub fails and throws this 
MasterRegistryFetchException. 
   
   Now that we decided to only do a single shutdown, we should either see the 
specific exception (derivate of IOException when shutdown failed) or no 
exception at all (if shutdown is a success, which is 99% of the test runs). I 
think we are good if we catch and log the first exception (don't remember the 
exact checked exception type).
   
   The reason I suggested to keep the checks simple and catch the exception and 
log is because, if there was any issue in the RPC, that'd anyway reflect in the 
failure of shutdown join thread and the test fails and we have the exception 
logged (we don't need to go fancy on checking exception.getCause() recursively).
   
   I think what Nick is suggesting is to catch the specific subclass of 
IOException that is thrown if shutdown() fails.  @ndimiduk Correct me if I got 
you wrong..




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hbase] bharathv commented on a change in pull request #1690: HBASE-24327 : Flaky connection in TestMasterShutdown#testMasterShutdo…

Reply via email to