Hey Sam, Ugh.. First off, really nice detective work!!! > degrades slowly with a long-lived (weeks and months) PVFS volume. > The degradation is significant -- simple metadata operations are an > order of magnitude slower after a month or so. The behavior turns > out to only occur with the VFS and pvfs2-client daemon: performance > of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of > servers remains good. Restarting the client daemon also fixes the > problem, suggesting that the long-lived open sockets are somehow the > cause. The slowness also appears to be at the servers not the > clients: the same kernel module and client daemon to a different > filesystem and set of servers doesn't exhibit the performance > degradation. > > Also, I should mention that the system config is a little different > than usual. We have IO nodes mounting and unmounting the PVFS > volume (and stopping the client daemon) with each user's job, which > is fairly frequent, while on the login nodes, the volume remains > mounted for a long time (and where the performance degrades). > > Our hunch here is that epoll or our use of epoll on the servers is > somehow to blame. Maybe the file descriptors opened on the server > for pvfs2-client-core are getting pushed down further and further > into the epoll set, which for some reason is growing with new > connections coming and going. This might be the case if we were > failing to remove sockets from the set on disconnect, for example. > It doesn't look like that's happening though, at least for normal > disconnects.
Just to make sure, can't we switch to a poll() based server and see if we have the same problem.. > > Its a PITA to debug, because the servers have to remain running for a > long time (and the clients have to remain mounted) for the problem to > be visible. Rob suggested I use strace on the servers to see what > epoll was doing, and that showed some interesting results. > Basically, it looks like epoll_wait takes significantly longer when > clients are doing operations over the VFS, rather than with the pvfs2 > admin tools. Also, strace reported epoll_ctl(..., > EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS > ops, and in those cases its returning EEXISTS. > > I noticed that we add a socket to the epoll set whenever we get a new > connection, or a read or write is posted (enqueue_operation), but we > only remove the socket from the epoll set on errors or disconnects. > So why are we adding it for reads and writes? Any connected socket > should already be in the set, no? I think this may be why I'm seeing > EEXISTS with strace. yep; Agreed; We shouldn;'t need to add it if it already exists. But that is not a bug as far as I can tell. > > Also, is it safe to check the error from epoll_ctl in > BMI_socket_collection_[add|remove]? Yep; We should be checking for the return value from these functions. Perhaps make the _add and _remove as inline functions with return values? > And finally, assuming PVFS is actually using epoll calls properly, > does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would > cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant > to? Googling epoll and SUSE 2.6.5 isn't turning up anything... Nope. none that I can think of.. thanks, Murali > > Thanks, > -sam > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
