Hey Sam,
Ugh..
First off, really nice detective work!!!

> degrades slowly with a long-lived (weeks and months) PVFS volume.
> The degradation is significant -- simple metadata operations are an
> order of magnitude slower after a month or so.  The behavior turns
> out to only occur with the VFS and pvfs2-client daemon:  performance
> of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of
> servers remains good.  Restarting the client daemon also fixes the
> problem, suggesting that the long-lived open sockets are somehow the
> cause.  The slowness also appears to be at the servers not the
> clients: the same kernel module and client daemon to a different
> filesystem and set of servers doesn't exhibit the performance
> degradation.
>
> Also, I should mention that the system config is a little different
> than usual.  We have IO nodes mounting and unmounting the PVFS
> volume  (and stopping the client daemon) with each user's job, which
> is fairly frequent, while on the login nodes, the volume remains
> mounted for a long time (and where the performance degrades).
>
> Our hunch here is that epoll or our use of epoll on the servers is
> somehow to blame.  Maybe the file descriptors opened on the server
> for pvfs2-client-core are getting pushed down further and further
> into the epoll set, which for some reason is growing with new
> connections coming and going.  This might be the case if we were
> failing to remove sockets from the set on disconnect, for example.
> It doesn't look like that's happening though, at least for normal
> disconnects.

Just to make sure, can't we switch to a poll() based server and see if
we have the same problem..


>
> Its a PITA to debug, because the servers have to remain running for a
> long time (and the clients have to remain mounted) for the problem to
> be visible.  Rob suggested I use strace on the servers to see what
> epoll was doing, and that showed some interesting results.
> Basically, it looks like epoll_wait takes significantly longer when
> clients are doing operations over the VFS, rather than with the pvfs2
> admin tools.  Also, strace reported epoll_ctl(...,
> EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
> ops, and in those cases its returning EEXISTS.
>
> I noticed that we add a socket to the epoll set whenever we get a new
> connection, or a read or write is posted (enqueue_operation), but we
> only remove the socket from the epoll set on errors or disconnects.
> So why are we adding it for reads and writes?  Any connected socket
> should already be in the set, no?  I think this may be why I'm seeing
> EEXISTS with strace.

yep; Agreed; We shouldn;'t need to add it if it already exists. But
that is not a bug as far
as I can tell.
>
> Also, is it safe to check the error from epoll_ctl in
> BMI_socket_collection_[add|remove]?

Yep; We should be checking for the return value from these functions.
Perhaps make the _add and _remove as inline functions with return values?

> And finally, assuming PVFS is actually using epoll calls properly,
> does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
> cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
> to?  Googling epoll and SUSE 2.6.5 isn't turning up anything...

Nope. none that I can think of..
thanks,
Murali
>
> Thanks,
> -sam
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to