[jira] [Commented] (SENTRY-1813) LeaderStatusMonitor could get into limbo state upon ZK connection loss

Vamsee Yarlagadda (JIRA) Mon, 26 Jun 2017 12:07:38 -0700

    [ 
https://issues.apache.org/jira/browse/SENTRY-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063623#comment-16063623
 ]


Vamsee Yarlagadda commented on SENTRY-1813:
-------------------------------------------

The actual underlying issue is that Sentry runtime classpath has multiple 
versions of curator (client abstraction for ZK). 
Sentry tries to pull in 2.11.1 version of curator where [Hadoop 
pulls|https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-project/pom.xml#L76]
 in 2.7.1 version of curator. 2.7.1 version has known issues during leader 
election process.
e.g https://issues.apache.org/jira/browse/CURATOR-202
Having both of these on the classpath leaves it to JVM to pick up a random jar 
during runtime.

We should ideally make sure Sentry always picks up the right version of 
curator.  

> LeaderStatusMonitor could get into limbo state upon ZK connection loss
> ----------------------------------------------------------------------
>
>                 Key: SENTRY-1813
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1813
>             Project: Sentry
>          Issue Type: Bug
>    Affects Versions: sentry-ha-redesign
>            Reporter: Vamsee Yarlagadda
>            Assignee: Vamsee Yarlagadda
>            Priority: Critical
>              Labels: sentry-ha
>             Fix For: sentry-ha-redesign
>
>         Attachments: Screenshot.png
>
>
> I noticed that during failover testing, if there was a connection loss with 
> ZK to the sentry servers, the one who is currently the leader gets into a 
> limbo state as it interrupts the Curator-LeaderSelector thread which no 
> longer gets revived in the running Sentry process (unless the process is 
> restarted).
> Relevant code under LeaderStatusMonitor
> http://github.mtv.cloudera.com/CDH/sentry/blob/cdh5-1.5.1/sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/service/thrift/LeaderStatusMonitor.java#L243-L246
> {code}
>    try {
>       isLeader = true;
>       // Wait until we are interrupted or receive a signal
>       cond.await();
>     } catch (InterruptedException ignored) {
>       Thread.currentThread().interrupt();
>       LOG.info("LeaderStatusMonitor: interrupted");
>     } finally {
>       isLeader = false;
>       lock.unlock();
>       LOG.info("LeaderStatusMonitor: becoming standby");
>     }
> {code}
> I realized even upon the loss of ZK connection, curator framework raises an 
> Interrupted Exception in LeaderStausMonitor which attempts to call interrupt 
> on Thread.currentThread which is essentially *Curator-LeaderSelector* thread.
> <SCREENSHOT_ATTACHED>
> So if the LeaderSelector thread is interrupted, this particular Sentry server 
> loses the capability of participating in a leader election in the future. And 
> if this happens to all the sentry servers in the cluster, any further loss 
> could get into a limbo state.
> And during this state, Sentry no longer reads events from HMS and thereby 
> users can no longer be able to issue DDL statements like CREATE etc. However 
> GRANT, REVOKE still work as they don't go through HMSFollower.
>   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SENTRY-1813) LeaderStatusMonitor could get into limbo state upon ZK connection loss

Reply via email to