[Pvfs2-developers] the halloween bug fixed

Sam Lang Fri, 05 Oct 2007 08:51:19 -0700

The halloween bug (what I'm calling it -- its been haunting us for awhile now) is that we're adding address references to the bmi addresslist, and never removing them. In the prelude state machine, we makea BMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates throughall the addresses in the reference list. Its this step that iscausing the slowdown. As new connections are made addr refs getadded to the list and never removed, so the pvfs2-client-core addrref ends up at the bottom of a very long list.

The addr refs aren't getting removed, because in BMI_set_info(addr,BMI_DEC_ADDR_REF) -- called from final_response -- the code queriesthe bmi_tcp method on whether the address should be removedBMI_tcp_get_info(BMI_DROP_ADDR_QUERY). This function always returnsfalse (don't drop), unless there was a bmi error somewhere (ECANCELis probably the only one that happens in practice -- due to a timeout).

Since our state actions block the main server thread, this causeddegradation for all requests received during processing of requestsfrom a long-lived socket. New connections hitting the server atdifferent times would have been fine though, which is what I wasseeing with my tests.

The obvious and easy fix is to have bmi-tcp return true fromDROP_ADDR_QUERY for all address references. As far as I can tell,the only thing we save by keeping them around is a little memoryallocation (the socket gets closed either way).

In the changes I've been working on to get multiple address supportin BMI, I've already replaced the linked list with a hashtable, whichwouldn't have made the problem go away, but the degradation wouldn'thave been quite as bad (may have made it harder to find, actually).Maybe its time to add some profiling info (perf stats?) to our basiclist, queue and hash structures that would tell us how big they'regetting.

Anyway, thanks to all for contributing to the debugging process forthis one.


-sam

On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:

Hi All,
I've been trying to debug a problem with PVFS, where performancedegrades slowly with a long-lived (weeks and months) PVFS volume.The degradation is significant -- simple metadata operations are anorder of magnitude slower after a month or so. The behavior turnsout to only occur with the VFS and pvfs2-client daemon:performance of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to thesame set of servers remains good. Restarting the client daemonalso fixes the problem, suggesting that the long-lived open socketsare somehow the cause. The slowness also appears to be at theservers not the clients: the same kernel module and client daemonto a different filesystem and set of servers doesn't exhibit theperformance degradation.
Also, I should mention that the system config is a little differentthan usual. We have IO nodes mounting and unmounting the PVFSvolume (and stopping the client daemon) with each user's job,which is fairly frequent, while on the login nodes, the volumeremains mounted for a long time (and where the performance degrades).
Our hunch here is that epoll or our use of epoll on the servers issomehow to blame. Maybe the file descriptors opened on the serverfor pvfs2-client-core are getting pushed down further and furtherinto the epoll set, which for some reason is growing with newconnections coming and going. This might be the case if we werefailing to remove sockets from the set on disconnect, for example.It doesn't look like that's happening though, at least for normaldisconnects.
Its a PITA to debug, because the servers have to remain running fora long time (and the clients have to remain mounted) for theproblem to be visible. Rob suggested I use strace on the serversto see what epoll was doing, and that showed some interestingresults. Basically, it looks like epoll_wait takes significantlylonger when clients are doing operations over the VFS, rather thanwith the pvfs2 admin tools. Also, strace reported epoll_ctl(...,EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFSops, and in those cases its returning EEXISTS.
I noticed that we add a socket to the epoll set whenever we get anew connection, or a read or write is posted (enqueue_operation),but we only remove the socket from the epoll set on errors ordisconnects. So why are we adding it for reads and writes? Anyconnected socket should already be in the set, no? I think thismay be why I'm seeing EEXISTS with strace.
Also, is it safe to check the error from epoll_ctl inBMI_socket_collection_[add|remove]?
And finally, assuming PVFS is actually using epoll calls properly,does anyone know of epoll bugs on a SUSE 2.6.5 kernel that wouldcause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meantto? Googling epoll and SUSE 2.6.5 isn't turning up anything...
Thanks,
-sam


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

[Pvfs2-developers] the halloween bug fixed

Reply via email to