Re: [Pvfs2-developers] epoll fun

Sam Lang Wed, 03 Oct 2007 10:43:05 -0700


On Oct 3, 2007, at 10:26 AM, Sam Lang wrote:

Hi Phil,
Thanks for your response. It sounds like its the same problem.Its nice to hear that its reproduceable (I'm not insane!), andmakes it that much more important to figure out.
I mention in my email to Murali that we see the same problem withpoll too. So we've taken the approach of trying to write asynthetic test that will reproduce the problem, which would help usverify that its poll/epoll related, and not something else we'redoing in PVFS. It would also give us something to show kerneldevelopers that its a problem. With the test I wrote though, Ihaven't been able to reproduce the problem (sigh). I'm not sure itexactly matches the behavior of BMI tcp, so I'm trying to gothrough the BMI code to verify that.
I've attached some plots of the poll and epoll timings (just thetimings of those system calls at the server). I haven't addedlabels, so I'll have to just explain what they show. I ran a testwhere I create and then delete 100 files using the VFS, and thencreate/delete 100 files using the pvfs2-* tools. I do this 10times, and then sleep for a while, and start over.
In the first plot (Iterations 1-10), the server and client daemonhad just been restarted. You can tell which set of 100 operationsare being timed by the epoll_ctl(ADD/DEL) calls. Even in the first10 iterations (each creating/deleting 2000 files), the slowdown isvisible. The other plots (Iterations 10-20, 20-28, and 28-32) alsoshow the degradation. Those are the epoll plots.

Somehow I managed to forget to mention that the x-axis is secondssince the re-start of the server and client, and the y-axis is theepoll time in seconds.

-sam

<Iterations 1-10.png>
<Iterations 10-20.png>
<Iterations 20-28.png>
<Iterations 28-32.png>
With poll, I just created one plot, and there aren't add/del eventsto show admin tools vs. the VFS. I also didn't sleep as longbetween iterations. The poll timings on the pvfs2-client daemonsocket are still getting visibly slower though, as can be seen onthe right of the graph, where the 100 create/delete operationsthrough the admin tools are the thin lines starting from 0.
<poll.png>
For a while I was starting to think that somehow sockets weren'tgetting removed from the epoll set properly, even though we weredoing epoll_ctl(...EPOLL_CTL_DEL...). The fact that it stillexists with poll though suggests that its something else, as the fdarray passed to poll isn't growing (I checked).
One thing I noticed: For an unexpected receive, we do a peek onthe socket (recv(...MSG_PEEK)) and see if a full bmi header isthere. If it is, we set the socket back to blocking mode, do ablocking recv for the header, and set then set the socket back tonon-blocking mode. Rob pointed out that we probably don't need toset the socket to blocking. Is there some reason we do that,instead of just do the recv in non-blocking mode since we'vealready checked that the bytes are there?
Any other ideas on what's going on here are more than welcome. :-)

-sam

On Oct 3, 2007, at 8:29 AM, Phil Carns wrote:
We have seen something that might be related, but we aren't sure.In our case, we had servers running 2.6.9 (with epoll enabled) andclients running an old 2.4 kernel (no epoll).
We saw that after about a week of very heavy loading, the clientswould get extremely slow for both metadata and I/O operations. Wedid not track it down to the level that you did, but it was clearthat every operation was generically slower, rather than beingable to point out one particular operation causing the slow down.
Unmounting and remounting did not help (note that pvfs2-clientdoes not close sockets on an unmount). However, the pvfs2-*command line utilities were still running full speed, andrestarting either pvfs2-client or the pvfs2-server processes wouldclear up the problem. From this we assumed it must have been aproblem with sockets that are kept open for extended periods. Wesuspected network errors to be a contributing factor, but werenever able to conclusively determine that.
We no longer have this environment, however, so we can't reallyoffer much more insight into what was going on. So far we haven'tseen it pop up anywhere else.
-Phil

Murali Vilayannur wrote:
Hey Sam,
Ugh..
First off, really nice detective work!!!
degrades slowly with a long-lived (weeks and months) PVFS volume.
The degradation is significant -- simple metadata operations are an
order of magnitude slower after a month or so.  The behavior turns
out to only occur with the VFS and pvfs2-client daemon:performance
of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of
servers remains good.  Restarting the client daemon also fixes the
problem, suggesting that the long-lived open sockets are somehowthe
cause.  The slowness also appears to be at the servers not the
clients: the same kernel module and client daemon to a different
filesystem and set of servers doesn't exhibit the performance
degradation.

Also, I should mention that the system config is a little different
than usual.  We have IO nodes mounting and unmounting the PVFS
volume (and stopping the client daemon) with each user's job,which
is fairly frequent, while on the login nodes, the volume remains
mounted for a long time (and where the performance degrades).

Our hunch here is that epoll or our use of epoll on the servers is
somehow to blame.  Maybe the file descriptors opened on the server
for pvfs2-client-core are getting pushed down further and further
into the epoll set, which for some reason is growing with new
connections coming and going.  This might be the case if we were
failing to remove sockets from the set on disconnect, for example.
It doesn't look like that's happening though, at least for normal
disconnects.
Just to make sure, can't we switch to a poll() based server andsee if
we have the same problem..
Its a PITA to debug, because the servers have to remain runningfor along time (and the clients have to remain mounted) for theproblem to
be visible.  Rob suggested I use strace on the servers to see what
epoll was doing, and that showed some interesting results.
Basically, it looks like epoll_wait takes significantly longer when
clients are doing operations over the VFS, rather than with thepvfs2
admin tools.  Also, strace reported epoll_ctl(...,
EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
ops, and in those cases its returning EEXISTS.
I noticed that we add a socket to the epoll set whenever we geta newconnection, or a read or write is posted (enqueue_operation),but we
only remove the socket from the epoll set on errors or disconnects.
So why are we adding it for reads and writes?  Any connected socket
should already be in the set, no? I think this may be why I'mseeing
EEXISTS with strace.
yep; Agreed; We shouldn;'t need to add it if it already exists. But
that is not a bug as far
as I can tell.
Also, is it safe to check the error from epoll_ctl in
BMI_socket_collection_[add|remove]?
Yep; We should be checking for the return value from thesefunctions.Perhaps make the _add and _remove as inline functions with returnvalues?
And finally, assuming PVFS is actually using epoll calls properly,
does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
to?  Googling epoll and SUSE 2.6.5 isn't turning up anything...
Nope. none that I can think of..
thanks,
Murali
Thanks,
-sam
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] epoll fun

Reply via email to