[Putting list back on cc]
On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:
> On 03/15/2013 04:23 PM, Greg Farnum wrote:
> > As I come back and look at these again, I'm not sure what the context
> > for these logs is. Which test did they come from, and which behavior
> > (slow or not slow, etc) did you see? :) -Greg
>
>
>
> They come from a test where I had debug mds = 20 and debug ms = 1
> on the MDS while writing files from 198 clients. It turns out that
> for some reason I need debug mds = 20 during writing to reproduce
> the slow stat behavior later.
>
> strace.find.dirs.txt.bz2 contains the log of running
> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls
> -lhd {} \;
>
> From that output, I believe that the stat of at least these files is slow:
> zero0.rc11
> zero0.rc30
> zero0.rc46
> zero0.rc8
> zero0.tc103
> zero0.tc105
> zero0.tc106
> I believe that log shows slow stats on more files, but those are the first
> few.
>
> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
> find command started, until just after the fifth or sixth slow stat from
> the list above.
>
> I haven't yet tried to find other ways of reproducing this, but so far
> it appears that something happens during the writing of the files that
> ends up causing the condition that results in slow stat commands.
>
> I have the full MDS log from the writing of the files, as well, but it's
> big....
>
> Is that what you were after?
>
> Thanks for taking a look!
>
> -- Jim
I just was coming back to these to see what new information was available, but
I realized we'd discussed several tests and I wasn't sure what these ones came
from. That information is enough, yes.
If in fact you believe you've only seen this with high-level MDS debugging, I
believe the cause is as I mentioned last time: the MDS is flapping a bit and so
some files get marked as "needsrecover", but they aren't getting recovered
asynchronously, and the first thing that pokes them into doing a recover is the
stat.
That's definitely not the behavior we want and so I'll be poking around the
code a bit and generating bugs, but given that explanation it's a bit less
scary than random slow stats are so it's not such a high priority. :) Do let me
know if you come across it without the MDS and clients having had connection
issues!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html