Re: [Pvfs2-developers] the halloween bug fixed

Rob Ross Fri, 05 Oct 2007 09:11:08 -0700

Well done Sam; thanks for tracking this one down. -- Rob


Sam Lang wrote:

The halloween bug (what I'm calling it -- its been haunting us for awhile now) is that we're adding address references to the bmi addresslist, and never removing them. In the prelude state machine, we make aBMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates through allthe addresses in the reference list. Its this step that is causing theslowdown. As new connections are made addr refs get added to the listand never removed, so the pvfs2-client-core addr ref ends up at thebottom of a very long list.
The addr refs aren't getting removed, because in BMI_set_info(addr,BMI_DEC_ADDR_REF) -- called from final_response -- the code queries thebmi_tcp method on whether the address should be removedBMI_tcp_get_info(BMI_DROP_ADDR_QUERY). This function always returnsfalse (don't drop), unless there was a bmi error somewhere (ECANCEL isprobably the only one that happens in practice -- due to a timeout).
Since our state actions block the main server thread, this causeddegradation for all requests received during processing of requests froma long-lived socket. New connections hitting the server at differenttimes would have been fine though, which is what I was seeing with mytests.
The obvious and easy fix is to have bmi-tcp return true fromDROP_ADDR_QUERY for all address references. As far as I can tell, theonly thing we save by keeping them around is a little memory allocation(the socket gets closed either way).
In the changes I've been working on to get multiple address support inBMI, I've already replaced the linked list with a hashtable, whichwouldn't have made the problem go away, but the degradation wouldn'thave been quite as bad (may have made it harder to find, actually).Maybe its time to add some profiling info (perf stats?) to our basiclist, queue and hash structures that would tell us how big they're getting.
Anyway, thanks to all for contributing to the debugging process for thisone.
-sam

On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
Hi All,
I've been trying to debug a problem with PVFS, where performancedegrades slowly with a long-lived (weeks and months) PVFS volume. Thedegradation is significant -- simple metadata operations are an orderof magnitude slower after a month or so. The behavior turns out toonly occur with the VFS and pvfs2-client daemon: performance of theadmin tools (pvfs2-touch, pvfs2-rm, etc.) to the same set of serversremains good. Restarting the client daemon also fixes the problem,suggesting that the long-lived open sockets are somehow the cause.The slowness also appears to be at the servers not the clients: thesame kernel module and client daemon to a different filesystem and setof servers doesn't exhibit the performance degradation.
Also, I should mention that the system config is a little differentthan usual. We have IO nodes mounting and unmounting the PVFS volume(and stopping the client daemon) with each user's job, which is fairlyfrequent, while on the login nodes, the volume remains mounted for along time (and where the performance degrades).
Our hunch here is that epoll or our use of epoll on the servers issomehow to blame. Maybe the file descriptors opened on the server forpvfs2-client-core are getting pushed down further and further into theepoll set, which for some reason is growing with new connectionscoming and going. This might be the case if we were failing to removesockets from the set on disconnect, for example. It doesn't look likethat's happening though, at least for normal disconnects.
Its a PITA to debug, because the servers have to remain running for along time (and the clients have to remain mounted) for the problem tobe visible. Rob suggested I use strace on the servers to see whatepoll was doing, and that showed some interesting results. Basically,it looks like epoll_wait takes significantly longer when clients aredoing operations over the VFS, rather than with the pvfs2 admintools. Also, strace reported epoll_ctl(..., EPOLL_CTL_ADD, ...))getting called a few times, even for the VFS ops, and in those casesits returning EEXISTS.
I noticed that we add a socket to the epoll set whenever we get a newconnection, or a read or write is posted (enqueue_operation), but weonly remove the socket from the epoll set on errors or disconnects.So why are we adding it for reads and writes? Any connected socketshould already be in the set, no? I think this may be why I'm seeingEEXISTS with strace.
Also, is it safe to check the error from epoll_ctl inBMI_socket_collection_[add|remove]?
And finally, assuming PVFS is actually using epoll calls properly,does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would causeepoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant to?Googling epoll and SUSE 2.6.5 isn't turning up anything...
Thanks,
-sam
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] the halloween bug fixed

Reply via email to