Vamsee Yarlagadda created SENTRY-1813:
-----------------------------------------

             Summary: LeaderStatusMonitor could get into limbo state upon ZK 
connection loss
                 Key: SENTRY-1813
                 URL: https://issues.apache.org/jira/browse/SENTRY-1813
             Project: Sentry
          Issue Type: Bug
    Affects Versions: sentry-ha-redesign
            Reporter: Vamsee Yarlagadda
            Priority: Critical
             Fix For: sentry-ha-redesign


I noticed that during failover testing, if there was a connection loss with ZK 
to the sentry servers, the one who is currently the leader gets into a limbo 
state as it interrupts the Curator-LeaderSelector thread which no longer gets 
revived in the running Sentry process (unless the process is restarted).

Relevant code under LeaderStatusMonitor
http://github.mtv.cloudera.com/CDH/sentry/blob/cdh5-1.5.1/sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/service/thrift/LeaderStatusMonitor.java#L243-L246
{code}
   try {
      isLeader = true;
      // Wait until we are interrupted or receive a signal
      cond.await();
    } catch (InterruptedException ignored) {
      Thread.currentThread().interrupt();
      LOG.info("LeaderStatusMonitor: interrupted");
    } finally {
      isLeader = false;
      lock.unlock();
      LOG.info("LeaderStatusMonitor: becoming standby");
    }
{code}

I realized even upon the loss of ZK connection, curator framework raises an 
Interrupted Exception in LeaderStausMonitor which attempts to call interrupt on 
Thread.currentThread which is essentially *Curator-LeaderSelector* thread.
<SCREENSHOT_ATTACHED>

So if the LeaderSelector thread is interrupted, this particular Sentry server 
loses the capability of participating in a leader election in the future. And 
if this happens to all the sentry servers in the cluster, any further loss 
could get into a limbo state.

And during this state, Sentry no longer reads events from HMS and thereby users 
can no longer be able to issue DDL statements like CREATE etc. However GRANT, 
REVOKE still work as they don't go through HMSFollower.
  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to