Re: cosd multi-second stalls cause "wrongly marked me down"

Gregory Farnum Wed, 23 Feb 2011 12:27:24 -0800

On Wednesday, February 23, 2011 at 11:23 AM, Jim Schutt wrote:
> > I have managed to get OSDs wrongly marking each other down during startup 
> > when they're peering large numbers of PGs/pools, as they disagree on who 
> > they need to be heartbeating (due to the slow handling of new osd maps and 
> > pg creates); if you're mostly seeing OSDs get incorrectly marked down 
> > during low epochs (your original email said epoch 7) this is probably what 
> > you're finding. 
> 
> What I've been trying to look for is heartbeat stalls after I 
> start up a bunch of clients writing. I'm really not sure why that
> original log caught one at such an early epoch - maybe there's
> two things going on?
> 
That wouldn't surprise me too much, but is something to keep in mind when 
observing. :)


> > We still have no idea what could be causing the stall *inside* of tick(), 
> > though. :/
> 
> I think that one was just lucky. Most of the stalls I've
> collected are between ticks.
Stalls between ticks make a lot of sense, since tick requires the osd_lock and 
we have some functions holding it for way too long, but as far as we can tell a 
stalled tick() function shouldn't break anything -- heartbeats are sent 
independently, and all the processing of heartbeats (where you detect down 
OSDs) is done inside of tick in such a way that it's not going to lose delivery 
of heartbeats -- that shouldn't be a problem!



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cosd multi-second stalls cause "wrongly marked me down"

Reply via email to