Hi folks (paticularly Erick and Shalin),

Before I go through the cycle of creating JIRAs and requesting formal
review, I wondered if I could get some feedback on some work I've been
doing to allow SolrCloud to startup faster and more reliably.

Problems:

1) Quickly restarting a node makes leader election unreliable; the existing
ZK node hasn't yet disappeared and confuses the current logic.  I believe I
have fixed this and simplified the logic.  This affects overseer election.

2) ZkController.publishAndWaitForDownStates() occurs before overseer
election.  That means if there is currently no overseer, there is
ironically no one to actually service the down state changes it's waiting
on.  This particularly affects a single-node cluster such as you might run
locally for development.

3) Audited our current implementations of process(WatchedEvent) for
consistency and handling edge cases.

4) Simplified DistributedMap; there's a whole lot more API surface area and
implementation machinery than we're using.

Code is here: https://github.com/fullstorydev/lucene-solr/pull/1
The individual commits might be informative.

Would some some feedback, and if these seem reasonable I'll open one or
more JIRAs and rebase the changes to trunk.

Thanks!
Scott

Reply via email to