On Wed, 2 Mar 2011, Jim Schutt wrote:
> On Tue, 2011-03-01 at 17:53 -0700, Sage Weil wrote:
> > Hi Jim,
> > 
> > We've fixed a few different bugs over the last week that were causing 
> > heartbeat issues. 
> 
> Great!
> 
> >  Nothing that explains why we would see the hang that 
> > you did, but other problems that caused the same 'wrongly marked me down' 
> > issue.  Are you still seeing this problem with the latest 'next' and/or 
> > 'master' branch?
> 
> I've been trying to isolate this on the stable branch
> since my last posting - I can still reproduce at will
> with my 96 osd test, but I haven't made much progress
> at tracking down what is going wrong.
> 
> > 
> > Also, if you don't mind reproducing, can you post a larger segment of the 
> > log? 
> 
> Sure.  I've got some extra debug printing going in
> my tree - the most interesting is a patch to log
> queue, operation, and total elapsed times in
> dispatch_entry() - it makes is really easy to
> find when things go wrong.
>
> I'll try to reproduce with master and post logs.
> Is it OK for me to add my extra debug patches for
> that?  I'll post them with the logs if so.

Absolutely.

> >  The really interesting question is what the heartbeat thread 
> > (heartbeat_entry()) is doing during this period that tick() is blocked up, 
> > since that's the thread that's responsible for sending the ping messages 
> > to peer OSDs.
> 
> One of the things I am seeing is handle_osd_ping()
> getting stalled, but I haven't been able to track
> down why.
> 
> I'll see if I see the same signature with master,
> and post logs.

Thanks!  Keep us posted.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to