Re: lagging peering wq

Gregory Farnum Fri, 25 Jan 2013 10:01:43 -0800

On Friday, January 25, 2013 at 9:50 AM, Sage Weil wrote:
> Faidon/paravoid's cluster has a bunch of OSDs that are up, but the pg 
> queries indicate they are tens of thousands of epochs behind:
> 
> "history": { "epoch_created": 14,
> "last_epoch_started": 88174,
> "last_epoch_clean": 88174,
> "last_epoch_split": 0,
> "same_up_since": 88172,
> "same_interval_since": 88172,
> "same_primary_since": 88172,
> 
> (where the current map epoch is 102000 or thereabouts).
> 
> I think just restarting all OSDs at once will get him caught up (esp with 
> a 'ceph osd set noup' block until they are done processing maps), but I 
> wonder if we may want an additional check that if any PG falls more than X 
> epochs behind the OSD marks it self down and catches up before coming 
> in...
> 
> What do you think?


Sam's explained to me why this "shouldn't" happen (since events for each PG get 
queued on every map update), so it sounds like it would be better to prevent 
the mess (e.g., add some basic fairness to the PG work queue dispatchers in 
order to prevent any PG from falling so far behind), rather than trying to 
clean the mess up.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lagging peering wq

Reply via email to