Hi Sam,
AweSAM! Man, this is really cool detective work!
>
> It looks like you're right Pete.  poll/epoll isn't to blame here.
> The poll calls are completed only a few microsecs after the request
> is received, even for the VFS cases where polls seems to be slower
> than with the admin tools.  It looks like the poll calls are
> "slowing" because the operations at the server are being delayed,
> causing them to just wait until another event occurs on the socket.
>
> So the real delay is somewhere on the server, between the first
> return of poll and the response being posted (11ms), whereas the VFS
> only takes 1.2ms to receive the response and send another request.

Another instance of a request scheduler stall on server perhaps?

> It boggles me how this behavior can take place with the VFS, and not
> with the admin tools.  How can an operation coming from the VFS take
> 11ms on the server, but for the same operation coming from the admin
> tool, it only takes 1ms?  I could believe that somehow our operations
> were getting gradually slower over time (which they appear to be),
> but why only with the VFS?

Drat.. THis is vexing indeed :(
Sam, I realize that you have ruled out client-side. But just staring
at these numbers
and experiments and the apparent difference in performance between VFS
and pvfs2 sysint, keeps pointing to pvfs2-client-core and interaction
with the device.
I cannot help but think that those layers are somehow slowing down things..

Sorry.. Totally out of ideas.:(
THanks,
Murali
>
> In any case, I think this explains the gradients we see in the
> graphs.  Each of the different visible contours is a different type
> of operation, and the operation is just taking a little bit longer
> each time.  Its easier to see in the poll plot below.  My guess is
> that the one at about 6ms is the getattr, then the two creates and
> the setattr at about 8ms, and finally, the contour at about 10ms is
> the crdirent.
>
> -sam
>
>
>
>
> On Oct 4, 2007, at 4:57 PM, Sam Lang wrote:
>
> >
> > On Oct 4, 2007, at 4:42 PM, Sam Lang wrote:
> >
> >>
> >> On Oct 4, 2007, at 3:45 PM, Pete Wyckoff wrote:
> >>
> >>> [EMAIL PROTECTED] wrote on Thu, 04 Oct 2007 15:21 -0500:
> >>>> I'm in the process of getting a trace and dump from the same server
> >>>> during the same runs.
> >>>
> >>> Good, will be interesting too see tcpdump + strace.
> >>
> >> Pete,
> >>
> >> Here's a trace and dump from the same set of runs.
> >>
> >> http://www.mcs.anl.gov/~slang/epoll-trace.out [370MB]
> >> http://www.mcs.anl.gov/~slang/server-packets.dump [180MB]
> >>
> >> Again, the trace isn't from strace, I'm tracing the epoll calls
> >> manually.
> >>
> >
> > Damnit, I forgot to recompile the servers.  These results are with
> > poll.  The slowdown still exists for these runs though.
> > -sam
> >
> >> -sam
> >>
> >>
> >>>   Although I
> >>> worry we might not learn anything new.
> >>>
> >>>> In the meantime, I attached two "zoomed" plots of the last one I
> >>>> sent, with ranges set to 5000-5200 secs and 5000-6000 secs,
> >>>> respectively.  I thought they were interesting, both for the
> >>>> behavior
> >>>> that epoll_wait exhibits during operations over the VFS, and
> >>>> just for
> >>>> the patterns they show at a scale of 100s of seconds (it almost
> >>>> looks
> >>>> like a seirpinski triangle).  Not that it sheds much light on the
> >>>> problem.
> >>>
> >>> Pretty.  Maybe there's a category for Most Attractive Unexplained
> >>> Distributed System Art in the state fair this year.
> >>>
> >>> Would the pending request ID stuff with slog etc be useful in
> >>> tracking individual operations?
> >>>
> >>>             -- Pete
> >>>
> >>
> >
>
>
>
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to