[
https://issues.apache.org/jira/browse/SENTRY-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061125#comment-16061125
]
Alexander Kolbasov commented on SENTRY-1813:
--------------------------------------------
Interesting. I got this from the official Apache Curator example at
https://git-wip-us.apache.org/repos/asf?p=curator.git;a=blob;f=curator-examples/src/main/java/leader/ExampleClient.java;h=6ec4a1f9d4a11fd52a44eafaca46b8ae9f9b40c4;hb=HEAD:
{code}
@Override
66 public void takeLeadership(CuratorFramework client) throws Exception
67 {
68 // we are now the leader. This method should not return until we
want to relinquish leadership
69
70 final int waitSeconds = (int)(5 * Math.random()) + 1;
71
72 System.out.println(name + " is now the leader. Waiting " +
waitSeconds + " seconds...");
73 System.out.println(name + " has been leader " +
leaderCount.getAndIncrement() + " time(s) before.");
74 try
75 {
76 Thread.sleep(TimeUnit.SECONDS.toMillis(waitSeconds));
77 }
78 catch ( InterruptedException e )
79 {
80 System.err.println(name + " was interrupted.");
81 Thread.currentThread().interrupt();
82 }
83 finally
84 {
85 System.out.println(name + " relinquishing leadership.\n");
86 }
87 }
{code}
> LeaderStatusMonitor could get into limbo state upon ZK connection loss
> ----------------------------------------------------------------------
>
> Key: SENTRY-1813
> URL: https://issues.apache.org/jira/browse/SENTRY-1813
> Project: Sentry
> Issue Type: Bug
> Affects Versions: sentry-ha-redesign
> Reporter: Vamsee Yarlagadda
> Assignee: Alexander Kolbasov
> Priority: Critical
> Labels: sentry-ha
> Fix For: sentry-ha-redesign
>
> Attachments: Screenshot.png
>
>
> I noticed that during failover testing, if there was a connection loss with
> ZK to the sentry servers, the one who is currently the leader gets into a
> limbo state as it interrupts the Curator-LeaderSelector thread which no
> longer gets revived in the running Sentry process (unless the process is
> restarted).
> Relevant code under LeaderStatusMonitor
> http://github.mtv.cloudera.com/CDH/sentry/blob/cdh5-1.5.1/sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/service/thrift/LeaderStatusMonitor.java#L243-L246
> {code}
> try {
> isLeader = true;
> // Wait until we are interrupted or receive a signal
> cond.await();
> } catch (InterruptedException ignored) {
> Thread.currentThread().interrupt();
> LOG.info("LeaderStatusMonitor: interrupted");
> } finally {
> isLeader = false;
> lock.unlock();
> LOG.info("LeaderStatusMonitor: becoming standby");
> }
> {code}
> I realized even upon the loss of ZK connection, curator framework raises an
> Interrupted Exception in LeaderStausMonitor which attempts to call interrupt
> on Thread.currentThread which is essentially *Curator-LeaderSelector* thread.
> <SCREENSHOT_ATTACHED>
> So if the LeaderSelector thread is interrupted, this particular Sentry server
> loses the capability of participating in a leader election in the future. And
> if this happens to all the sentry servers in the cluster, any further loss
> could get into a limbo state.
> And during this state, Sentry no longer reads events from HMS and thereby
> users can no longer be able to issue DDL statements like CREATE etc. However
> GRANT, REVOKE still work as they don't go through HMSFollower.
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)