[
https://issues.apache.org/jira/browse/SENTRY-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063623#comment-16063623
]
Vamsee Yarlagadda commented on SENTRY-1813:
-------------------------------------------
The actual underlying issue is that Sentry runtime classpath has multiple
versions of curator (client abstraction for ZK).
Sentry tries to pull in 2.11.1 version of curator where [Hadoop
pulls|https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-project/pom.xml#L76]
in 2.7.1 version of curator. 2.7.1 version has known issues during leader
election process.
e.g https://issues.apache.org/jira/browse/CURATOR-202
Having both of these on the classpath leaves it to JVM to pick up a random jar
during runtime.
We should ideally make sure Sentry always picks up the right version of
curator.
> LeaderStatusMonitor could get into limbo state upon ZK connection loss
> ----------------------------------------------------------------------
>
> Key: SENTRY-1813
> URL: https://issues.apache.org/jira/browse/SENTRY-1813
> Project: Sentry
> Issue Type: Bug
> Affects Versions: sentry-ha-redesign
> Reporter: Vamsee Yarlagadda
> Assignee: Vamsee Yarlagadda
> Priority: Critical
> Labels: sentry-ha
> Fix For: sentry-ha-redesign
>
> Attachments: Screenshot.png
>
>
> I noticed that during failover testing, if there was a connection loss with
> ZK to the sentry servers, the one who is currently the leader gets into a
> limbo state as it interrupts the Curator-LeaderSelector thread which no
> longer gets revived in the running Sentry process (unless the process is
> restarted).
> Relevant code under LeaderStatusMonitor
> http://github.mtv.cloudera.com/CDH/sentry/blob/cdh5-1.5.1/sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/service/thrift/LeaderStatusMonitor.java#L243-L246
> {code}
> try {
> isLeader = true;
> // Wait until we are interrupted or receive a signal
> cond.await();
> } catch (InterruptedException ignored) {
> Thread.currentThread().interrupt();
> LOG.info("LeaderStatusMonitor: interrupted");
> } finally {
> isLeader = false;
> lock.unlock();
> LOG.info("LeaderStatusMonitor: becoming standby");
> }
> {code}
> I realized even upon the loss of ZK connection, curator framework raises an
> Interrupted Exception in LeaderStausMonitor which attempts to call interrupt
> on Thread.currentThread which is essentially *Curator-LeaderSelector* thread.
> <SCREENSHOT_ATTACHED>
> So if the LeaderSelector thread is interrupted, this particular Sentry server
> loses the capability of participating in a leader election in the future. And
> if this happens to all the sentry servers in the cluster, any further loss
> could get into a limbo state.
> And during this state, Sentry no longer reads events from HMS and thereby
> users can no longer be able to issue DDL statements like CREATE etc. However
> GRANT, REVOKE still work as they don't go through HMSFollower.
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)