[ 
https://issues.apache.org/jira/browse/SENTRY-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061125#comment-16061125
 ] 

Alexander Kolbasov commented on SENTRY-1813:
--------------------------------------------

Interesting. I got this from the official Apache Curator example at 
https://git-wip-us.apache.org/repos/asf?p=curator.git;a=blob;f=curator-examples/src/main/java/leader/ExampleClient.java;h=6ec4a1f9d4a11fd52a44eafaca46b8ae9f9b40c4;hb=HEAD:

{code}
@Override
  66     public void takeLeadership(CuratorFramework client) throws Exception
  67     {
  68         // we are now the leader. This method should not return until we 
want to relinquish leadership
  69 
  70         final int         waitSeconds = (int)(5 * Math.random()) + 1;
  71 
  72         System.out.println(name + " is now the leader. Waiting " + 
waitSeconds + " seconds...");
  73         System.out.println(name + " has been leader " + 
leaderCount.getAndIncrement() + " time(s) before.");
  74         try
  75         {
  76             Thread.sleep(TimeUnit.SECONDS.toMillis(waitSeconds));
  77         }
  78         catch ( InterruptedException e )
  79         {
  80             System.err.println(name + " was interrupted.");
  81             Thread.currentThread().interrupt();
  82         }
  83         finally
  84         {
  85             System.out.println(name + " relinquishing leadership.\n");
  86         }
  87     }
{code}

> LeaderStatusMonitor could get into limbo state upon ZK connection loss
> ----------------------------------------------------------------------
>
>                 Key: SENTRY-1813
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1813
>             Project: Sentry
>          Issue Type: Bug
>    Affects Versions: sentry-ha-redesign
>            Reporter: Vamsee Yarlagadda
>            Assignee: Alexander Kolbasov
>            Priority: Critical
>              Labels: sentry-ha
>             Fix For: sentry-ha-redesign
>
>         Attachments: Screenshot.png
>
>
> I noticed that during failover testing, if there was a connection loss with 
> ZK to the sentry servers, the one who is currently the leader gets into a 
> limbo state as it interrupts the Curator-LeaderSelector thread which no 
> longer gets revived in the running Sentry process (unless the process is 
> restarted).
> Relevant code under LeaderStatusMonitor
> http://github.mtv.cloudera.com/CDH/sentry/blob/cdh5-1.5.1/sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/service/thrift/LeaderStatusMonitor.java#L243-L246
> {code}
>    try {
>       isLeader = true;
>       // Wait until we are interrupted or receive a signal
>       cond.await();
>     } catch (InterruptedException ignored) {
>       Thread.currentThread().interrupt();
>       LOG.info("LeaderStatusMonitor: interrupted");
>     } finally {
>       isLeader = false;
>       lock.unlock();
>       LOG.info("LeaderStatusMonitor: becoming standby");
>     }
> {code}
> I realized even upon the loss of ZK connection, curator framework raises an 
> Interrupted Exception in LeaderStausMonitor which attempts to call interrupt 
> on Thread.currentThread which is essentially *Curator-LeaderSelector* thread.
> <SCREENSHOT_ATTACHED>
> So if the LeaderSelector thread is interrupted, this particular Sentry server 
> loses the capability of participating in a leader election in the future. And 
> if this happens to all the sentry servers in the cluster, any further loss 
> could get into a limbo state.
> And during this state, Sentry no longer reads events from HMS and thereby 
> users can no longer be able to issue DDL statements like CREATE etc. However 
> GRANT, REVOKE still work as they don't go through HMSFollower.
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to