virajjasani commented on a change in pull request #1690:
URL: https://github.com/apache/hbase/pull/1690#discussion_r424883373
##########
File path:
hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestMasterShutdown.java
##########
@@ -151,19 +152,46 @@ public void
testMasterShutdownBeforeStartingAnyRegionServer() throws Exception {
hbaseCluster = new LocalHBaseCluster(htu.getConfiguration(),
options.getNumMasters(),
options.getNumRegionServers(), options.getMasterClass(),
options.getRsClass());
final MasterThread masterThread = hbaseCluster.getMasters().get(0);
+
masterThread.start();
- // Switching to master registry exacerbated a race in the master
bootstrap that can result
- // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is
essentially because
- // the server manager in HMaster is not initialized by the time
shutdown() RPC (below) is
- // made to the master. The suspected reason as to why it was uncommon
before HBASE-18095
- // is because the connection creation with ZK registry is so slow that
by then the server
- // manager is usually init'ed in time for the RPC to be made. For now,
adding an explicit
- // wait() in the test, waiting for the server manager to become
available.
- final long timeout = TimeUnit.MINUTES.toMillis(10);
- assertNotEquals("Timeout waiting for server manager to become
available.",
- -1, Waiter.waitFor(htu.getConfiguration(), timeout,
- () -> masterThread.getMaster().getServerManager() != null));
- htu.getConnection().getAdmin().shutdown();
+ final CompletableFuture<Void> shutdownFuture =
CompletableFuture.runAsync(() -> {
+ // Switching to master registry exacerbated a race in the master
bootstrap that can result
+ // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is
essentially because
+ // the server manager in HMaster is not initialized by the time
shutdown() RPC (below) is
+ // made to the master. The suspected reason as to why it was uncommon
before HBASE-18095
+ // is because the connection creation with ZK registry is so slow that
by then the server
+ // manager is usually init'ed in time for the RPC to be made. For now,
adding an explicit
+ // wait() in the test, waiting for the server manager to become
available.
+ final long timeout = TimeUnit.MINUTES.toMillis(10);
+ assertNotEquals("timeout waiting for server manager to become
available.", -1,
+ htu.waitFor(timeout, () ->
masterThread.getMaster().getServerManager() != null));
+
+ // Master has come up far enough that we can terminate it without
creating a zombie.
+ LOG.debug("Attempting to establish connection.");
+ try {
+ // HBASE-24327 : (Resolve Flaky connection issues)
+ // shutdown() RPC can have flaky ZK connection issues.
+ // e.g
+ // ERROR
[RpcServer.priority.RWQ.Fifo.read.handler=1,queue=1,port=53033]
+ // master.HMaster(2878): ZooKeeper exception trying to set cluster
as down in ZK
+ // org.apache.zookeeper.KeeperException$SystemErrorException:
+ // KeeperErrorCode = SystemError
+ //
+ // However, even when above flakes happen, shutdown call does get
completed even if
+ // RPC call has failure. Hence, subsequent retries will never
succeed as HMaster is
+ // already shutdown. Hence, it can fail. To resolve it, after making
one shutdown()
+ // call, we are ignoring IOException.
+ htu.getConnection().getAdmin().shutdown();
+ LOG.info("Shutdown RPC sent.");
+ } catch (IOException | CompletionException e) {
+ LOG.warn("Failed to establish connection.", e);
Review comment:
@ndimiduk @bharathv
I will remove `CompletionException` from catch list. And what about
IOException? Good to follow above code to find out the root cause message and
accordingly handle it?
This is what I had in my first commit and then Bharath and I came to
conclusion that it's bit hacky and why not just handle IOException and not
worry about cause. But now that I again think of it, you are right, although
there are less chances of having different root cause for IOException but still
it's good to be specific, that way not too hacky right?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]