[Pvfs2-developers] epoll fun

Sam Lang Wed, 26 Sep 2007 16:01:48 -0700


Hi All,

I've been trying to debug a problem with PVFS, where performancedegrades slowly with a long-lived (weeks and months) PVFS volume.The degradation is significant -- simple metadata operations are anorder of magnitude slower after a month or so. The behavior turnsout to only occur with the VFS and pvfs2-client daemon: performanceof the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set ofservers remains good. Restarting the client daemon also fixes theproblem, suggesting that the long-lived open sockets are somehow thecause. The slowness also appears to be at the servers not theclients: the same kernel module and client daemon to a differentfilesystem and set of servers doesn't exhibit the performancedegradation.

Also, I should mention that the system config is a little differentthan usual. We have IO nodes mounting and unmounting the PVFSvolume (and stopping the client daemon) with each user's job, whichis fairly frequent, while on the login nodes, the volume remainsmounted for a long time (and where the performance degrades).

Our hunch here is that epoll or our use of epoll on the servers issomehow to blame. Maybe the file descriptors opened on the serverfor pvfs2-client-core are getting pushed down further and furtherinto the epoll set, which for some reason is growing with newconnections coming and going. This might be the case if we werefailing to remove sockets from the set on disconnect, for example.It doesn't look like that's happening though, at least for normaldisconnects.

Its a PITA to debug, because the servers have to remain running for along time (and the clients have to remain mounted) for the problem tobe visible. Rob suggested I use strace on the servers to see whatepoll was doing, and that showed some interesting results.Basically, it looks like epoll_wait takes significantly longer whenclients are doing operations over the VFS, rather than with the pvfs2admin tools. Also, strace reported epoll_ctl(...,EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFSops, and in those cases its returning EEXISTS.

I noticed that we add a socket to the epoll set whenever we get a newconnection, or a read or write is posted (enqueue_operation), but weonly remove the socket from the epoll set on errors or disconnects.So why are we adding it for reads and writes? Any connected socketshould already be in the set, no? I think this may be why I'm seeingEEXISTS with strace.

Also, is it safe to check the error from epoll_ctl inBMI_socket_collection_[add|remove]?

And finally, assuming PVFS is actually using epoll calls properly,does anyone know of epoll bugs on a SUSE 2.6.5 kernel that wouldcause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meantto? Googling epoll and SUSE 2.6.5 isn't turning up anything...


Thanks,
-sam
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] epoll fun

Reply via email to