[ 
https://issues.apache.org/jira/browse/SOLR-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938539#comment-13938539
 ] 

Mark Miller commented on SOLR-5872:
-----------------------------------

bq. With the overseer queues, each state update is 4+ zookeeper writes

Given the numbers I've seen published for ZK performance, it seems like that 
should not be a big deal in typical cases?

bq. Empirically, we have definitely seen the workqueue back up with lots of 
items during a node bounce

I'm not surprised - most of this code has not been optimized or investigated 
thoroughly. The original author of a lot of the Overseer code has moved on and 
it likely has not seen as much attention as would be nice over the past year. 
Until someone looks into the current issues closely though, it seems hard to 
recommend rewriting this whole very important piece.

bq. If batching really is so important, there's no batching for external 
collection state updates.

I'm not really fully up on "external collections" but AFAIK it's part of some 
other work to support tons of collections that I'm not fully sold on yet either 
:)

bq. In a "normal" rolling bounce where instances are restarted one-by-one, in 
the same order each time, the Overseer is killed at each instance restart, thus 
hindering the recovery process by gating state transition.

This points out another issue that we might be able to address.

Without having looked closely at the issues brought up (and I don't see 
evidence anyone else has either), it's hard to draw the conclusion the whole 
thing just has to be replaced yet.

A couple issues around the old implementation:

* With every node updating the whole cluster state on state change, the 
clusterstate.json file is read far too much. The workaround you guys are 
proposing for that appears to be only having clients update the clusterstate 
when they run into an error - but I'm not sold that that is the best 
architecture for the future either. That's a complicated change to make, with 
many ramifications for future development.

* Some things that are in the clusterstate now and that could be in the future 
are not so easily handled with the non overseer strategy - like marking who is 
the leader. You have to have the Overseer running its own special thread to 
inject and remove information.

* As things are, on something like cluster startup, there will be tons of reads 
and writes of the clusterstate.json - a flood of attempts and retries to update 
it in ZooKeeper.

For further discussion around the change, there should be background if you 
search the archives.

There is a strong argument to be made that we should first investigate the 
performance issues with the current strategy. ZooKeeper is pretty fast - these 
state updates are tiny and batched. It seems like we should be able to do a lot 
better without throwing out code that has been getting hardened for a long time 
now.



> Eliminate overseer queue 
> -------------------------
>
>                 Key: SOLR-5872
>                 URL: https://issues.apache.org/jira/browse/SOLR-5872
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Noble Paul
>            Assignee: Noble Paul
>
> The overseer queue is one of the busiest points in the entire system. The 
> raison d'ĂȘtre of the queue is
>  * Provide batching of operations for the main clusterstate,json so that 
> state updates are minimized 
> * Avoid race conditions and ensure order
> Now , as we move the individual collection states out of the main 
> clusterstate.json, the batching is not useful anymore.
> Race conditions can easily be solved by using a compare and set in Zookeeper. 
> The proposed solution  is , whenever an operation is required to be performed 
> on the clusterstate, the same thread (and of course the same JVM)
>  # read the fresh state and version of zk node  
>  # construct the new state 
>  # perform a compare and set
>  # if compare and set fails go to step 1
> This should be limited to all operations performed on external collections 
> because batching would be required for others 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to