I seem to recall seeing this on my cluster when we didn't have clocks in sync, but perhaps my memory is fuzzy as well.
-Grant On Aug 7, 2013, at 7:41 AM, Erick Erickson <[email protected]> wrote: > Well, we're reconstructing a chain of _possibilities_ post-mortem, > so there's not much I can say for sure. Mostly just throwing this > out there in case it sparks some "aha" moments. Not knowing > ZK well, anything I say is speculation. > > But I speculate that this isn't really the root of the problem given > that we haven't been seeing the "ClusterState says we are the leader..." > error go by the user lists for a while. It may well be a coincidence. The > place that this happened reported that the problem "seemed to > be better" after adjusting the ZK nodes' times. I know when I > reconstruct events like this I'm never sure about cause and > effect since I'm usually doing several things at once. > > Erick > > > On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter <[email protected]> > wrote: > > : > When the times were coordinated, many of the problems with recovery went > : > away. We're trying to reconstruct the scenario from memory, but it > : > prompted me to pass the incident in case it sparked any thoughts. > : > Specifically, I wonder if there's anything that comes to mind if the ZK > : > nodes are significantly out of synch with each other time-wise. > : > : Does this mean that ntp or other strict time synchronization is important > for > : SolrCloud? I strive for this anyway, just to ensure that when I'm > researching > : log files between two machines that I can match things up properly. > > I don't know if/how Solr/ZK is affected by having machines with clocks out > of sync, but i do remember seeing discussions a while back about weird > things happening ot ZK client apps *while* time adjustments are taking > place to get back in sync. > > IIRC: as the local clock starts accelerating and jumping ahead in > increments to "correct" itself with ntp, then those jumps can confuse the > ZK code into thinking it's been waiting a lot longer then it really > has for zk heartbeat (or whatever it's called) and it can trigger a > timeout situation. > > > -Hoss > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -------------------------------------------- Grant Ingersoll | @gsingers http://www.lucidworks.com
