[ 
https://issues.apache.org/jira/browse/ACCUMULO-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478595#comment-15478595
 ] 

Josh Elser commented on ACCUMULO-4424:
--------------------------------------

The general approach here is to start the Thrift Servers for the Master and the 
HTTP server for the monitor and then block on obtaining the ZooKeeper lock.

The trick here is that we don't want to accept any RPCs until the lock is 
acquired. I have trivially done this with an InvocationHandler around the 
Thrift IFace or a quick check in the Monitor servlets.

Turns out that GC already had been doing this. We don't care about protecting 
its RPC server since it's just metrics.

One concern I have is that the {{ZooLock.isLockHeld()}} method which is getting 
invoked is a synchronized method. This would mean that for every RPC the master 
gets, we would be grabbing that lock and then actually processing the RPC. I 
need to dig a little and see if this is actually going to be an issue...

> Do not wait to start Thrift servers until lock is acquired
> ----------------------------------------------------------
>
>                 Key: ACCUMULO-4424
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4424
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 2.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Had an Accumulo + Ambari user report a funny issue: 
> https://community.hortonworks.com/questions/53203/ambari-is-showing-alerts-on-the-accumulo-service-e.html
> When starting multiple masters, monitors, GC's: they observed that, despite 
> Accumulo being healthy, Ambari kept reporting that 2/3rd of each service were 
> down. This is because Ambari is expecting that the Thrift service is up as a 
> service check.
> Presently, for services where only one active instance is allowed, we do not 
> put up the thrift server until we acquire the leader ZK lock. I propose that 
> we still start these servers but introduce a barrier to prevent any API calls 
> from succeeding until the leader lock is obtained. This has a couple of 
> benefits:
> * Better "health" check -- processes might be zombie'd, pidfile check would 
> be insufficient
> * Less confusion around process which is running but not binding the port 
> (have personally dealt with a case where a user was confused and thought the 
> services where incorrectly stuck on startup)
> I believe this would also be pretty simple to do since the leader election is 
> already implemented in one place (just the znode differs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to