I seem to recall seeing this on my cluster when we didn't have clocks in sync, 
but perhaps my memory is fuzzy as well.

-Grant

On Aug 7, 2013, at 7:41 AM, Erick Erickson <[email protected]> wrote:

> Well, we're reconstructing a chain of _possibilities_ post-mortem,
> so there's not much I can say for sure. Mostly just throwing this 
> out there in case it sparks some "aha" moments. Not knowing
> ZK well, anything I say is speculation.
> 
> But I speculate that this isn't really the root of the problem given
> that we haven't been seeing the "ClusterState says we are the leader..."
> error go by the user lists for a while. It may well be a coincidence. The
> place that this happened reported that the problem "seemed to 
> be better" after adjusting the ZK nodes' times. I know when I
> reconstruct events like this I'm never sure about cause and
> effect since I'm usually doing several things at once.
> 
> Erick
> 
> 
> On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter <[email protected]> 
> wrote:
> 
> : > When the times were coordinated, many of the problems with recovery went
> : > away. We're trying to reconstruct the scenario from memory, but it
> : > prompted me to pass the incident in case it sparked any thoughts.
> : > Specifically, I wonder if there's anything that comes to mind if the ZK
> : > nodes are significantly out of synch with each other time-wise.
> :
> : Does this mean that ntp or other strict time synchronization is important 
> for
> : SolrCloud?  I strive for this anyway, just to ensure that when I'm 
> researching
> : log files between two machines that I can match things up properly.
> 
> I don't know if/how Solr/ZK is affected by having machines with clocks out
> of sync, but i do remember seeing discussions a while back about weird
> things happening ot ZK client apps *while* time adjustments are taking
> place to get back in sync.
> 
> IIRC: as the local clock starts accelerating and jumping ahead in
> increments to "correct" itself with ntp, then those jumps can confuse the
> ZK code into thinking it's been waiting a lot longer then it really
> has for zk heartbeat (or whatever it's called) and it can trigger a
> timeout situation.
> 
> 
> -Hoss
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com





Reply via email to