ndimiduk commented on a change in pull request #1690:
URL: https://github.com/apache/hbase/pull/1690#discussion_r424789318



##########
File path: 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterShutdown.java
##########
@@ -151,19 +152,46 @@ public void 
testMasterShutdownBeforeStartingAnyRegionServer() throws Exception {
       hbaseCluster = new LocalHBaseCluster(htu.getConfiguration(), 
options.getNumMasters(),
         options.getNumRegionServers(), options.getMasterClass(), 
options.getRsClass());
       final MasterThread masterThread = hbaseCluster.getMasters().get(0);
+
       masterThread.start();
-      // Switching to master registry exacerbated a race in the master 
bootstrap that can result
-      // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is 
essentially because
-      // the server manager in HMaster is not initialized by the time 
shutdown() RPC (below) is
-      // made to the master. The suspected reason as to why it was uncommon 
before HBASE-18095
-      // is because the connection creation with ZK registry is so slow that 
by then the server
-      // manager is usually init'ed in time for the RPC to be made. For now, 
adding an explicit
-      // wait() in the test, waiting for the server manager to become 
available.
-      final long timeout = TimeUnit.MINUTES.toMillis(10);
-      assertNotEquals("Timeout waiting for server manager to become 
available.",
-        -1, Waiter.waitFor(htu.getConfiguration(), timeout,
-          () -> masterThread.getMaster().getServerManager() != null));
-      htu.getConnection().getAdmin().shutdown();
+      final CompletableFuture<Void> shutdownFuture = 
CompletableFuture.runAsync(() -> {

Review comment:
       Ah, and that one smells a bit like HBASE-23836.

##########
File path: 
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterShutdown.java
##########
@@ -151,19 +152,46 @@ public void 
testMasterShutdownBeforeStartingAnyRegionServer() throws Exception {
       hbaseCluster = new LocalHBaseCluster(htu.getConfiguration(), 
options.getNumMasters(),
         options.getNumRegionServers(), options.getMasterClass(), 
options.getRsClass());
       final MasterThread masterThread = hbaseCluster.getMasters().get(0);
+
       masterThread.start();
-      // Switching to master registry exacerbated a race in the master 
bootstrap that can result
-      // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is 
essentially because
-      // the server manager in HMaster is not initialized by the time 
shutdown() RPC (below) is
-      // made to the master. The suspected reason as to why it was uncommon 
before HBASE-18095
-      // is because the connection creation with ZK registry is so slow that 
by then the server
-      // manager is usually init'ed in time for the RPC to be made. For now, 
adding an explicit
-      // wait() in the test, waiting for the server manager to become 
available.
-      final long timeout = TimeUnit.MINUTES.toMillis(10);
-      assertNotEquals("Timeout waiting for server manager to become 
available.",
-        -1, Waiter.waitFor(htu.getConfiguration(), timeout,
-          () -> masterThread.getMaster().getServerManager() != null));
-      htu.getConnection().getAdmin().shutdown();
+      final CompletableFuture<Void> shutdownFuture = 
CompletableFuture.runAsync(() -> {
+        // Switching to master registry exacerbated a race in the master 
bootstrap that can result
+        // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is 
essentially because
+        // the server manager in HMaster is not initialized by the time 
shutdown() RPC (below) is
+        // made to the master. The suspected reason as to why it was uncommon 
before HBASE-18095
+        // is because the connection creation with ZK registry is so slow that 
by then the server
+        // manager is usually init'ed in time for the RPC to be made. For now, 
adding an explicit
+        // wait() in the test, waiting for the server manager to become 
available.
+        final long timeout = TimeUnit.MINUTES.toMillis(10);
+        assertNotEquals("timeout waiting for server manager to become 
available.", -1,
+          htu.waitFor(timeout, () -> 
masterThread.getMaster().getServerManager() != null));
+
+        // Master has come up far enough that we can terminate it without 
creating a zombie.
+        LOG.debug("Attempting to establish connection.");
+        try {
+          // HBASE-24327 : (Resolve Flaky connection issues)
+          // shutdown() RPC can have flaky ZK connection issues.
+          // e.g
+          // ERROR 
[RpcServer.priority.RWQ.Fifo.read.handler=1,queue=1,port=53033]
+          // master.HMaster(2878): ZooKeeper exception trying to set cluster 
as down in ZK
+          // org.apache.zookeeper.KeeperException$SystemErrorException:
+          // KeeperErrorCode = SystemError
+          //
+          // However, even when above flakes happen, shutdown call does get 
completed even if
+          // RPC call has failure. Hence, subsequent retries will never 
succeed as HMaster is
+          // already shutdown. Hence, it can fail. To resolve it, after making 
one shutdown()
+          // call, we are ignoring IOException.
+          htu.getConnection().getAdmin().shutdown();
+          LOG.info("Shutdown RPC sent.");
+        } catch (IOException | CompletionException e) {
+          LOG.warn("Failed to establish connection.", e);

Review comment:
       Thanks for the links to the previous commits. GH makes it difficult for 
me to follow previous conversation.
   
   Okay, so if you move this code out of the `CompletableFuture`, the 
`CompletionException` can just be re-thrown. It indicates a client-side 
problem, and there's nothing the test should try to recover.
   
   I would not catch `IOException` either, just let it bubble up. Instead, I 
would focus on meaningful subclasses of `IOException`. 
`RetriesExhaustedException` is a good place to work from: you know your client 
made some effort. There's potentially multiple cause instances under there, so 
i guess just pick one to work with, probably the last one. If it's a descendent 
of `java.net.SocketException`, you know the client couldn't get anywhere -- how 
should the test behave if the the master is not running?
   
   After that, I'm not really sure what's thrown in what cases. Our API's 
checked exception definitions aren't strong enough for me know by reading the 
interfaces. However, pretty much anything else means the client managed to get 
the RPC over to the server. I think that fact alone is enough to consider this 
part of the test has succeeded at its goal. Until the myriad other issues in 
master startup and shutdown are resolved, I think this is the best the client 
can hope for (for what it's worth, I think this test will continue to be flakey 
until those master-side problems are solved, and they cannot be resolved 
perfectly by client-side gymnastics).
   
   You've looked at it, and seen the errors and test failures, more recently 
than I have, so what do you think? What brought you to this ticket in the first 
place? Are there more specific subclasses of `IOException` that are thrown, 
which you can use to reasonably address that specific condition? Paste the 
stack traces into Jira and the commit message so we can follow your effort.
   
   And thank you for your effort -- test fixing is usually thankless drudgery 
but it makes all of our lives better :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to