Awesome! Great job, Sam! Perhaps, you could also crossport the hash table fixes also along with the fix for this? thanks! Murali
On 10/5/07, Sam Lang <[EMAIL PROTECTED]> wrote: > > The halloween bug (what I'm calling it -- its been haunting us for a > while now) is that we're adding address references to the bmi address > list, and never removing them. In the prelude state machine, we make > a BMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates through > all the addresses in the reference list. Its this step that is > causing the slowdown. As new connections are made addr refs get > added to the list and never removed, so the pvfs2-client-core addr > ref ends up at the bottom of a very long list. > > The addr refs aren't getting removed, because in BMI_set_info(addr, > BMI_DEC_ADDR_REF) -- called from final_response -- the code queries > the bmi_tcp method on whether the address should be removed > BMI_tcp_get_info(BMI_DROP_ADDR_QUERY). This function always returns > false (don't drop), unless there was a bmi error somewhere (ECANCEL > is probably the only one that happens in practice -- due to a timeout). > > Since our state actions block the main server thread, this caused > degradation for all requests received during processing of requests > from a long-lived socket. New connections hitting the server at > different times would have been fine though, which is what I was > seeing with my tests. > > The obvious and easy fix is to have bmi-tcp return true from > DROP_ADDR_QUERY for all address references. As far as I can tell, > the only thing we save by keeping them around is a little memory > allocation (the socket gets closed either way). > > In the changes I've been working on to get multiple address support > in BMI, I've already replaced the linked list with a hashtable, which > wouldn't have made the problem go away, but the degradation wouldn't > have been quite as bad (may have made it harder to find, actually). > Maybe its time to add some profiling info (perf stats?) to our basic > list, queue and hash structures that would tell us how big they're > getting. > > Anyway, thanks to all for contributing to the debugging process for > this one. > > -sam > > On Sep 26, 2007, at 6:00 PM, Sam Lang wrote: > > > > > Hi All, > > > > I've been trying to debug a problem with PVFS, where performance > > degrades slowly with a long-lived (weeks and months) PVFS volume. > > The degradation is significant -- simple metadata operations are an > > order of magnitude slower after a month or so. The behavior turns > > out to only occur with the VFS and pvfs2-client daemon: > > performance of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the > > same set of servers remains good. Restarting the client daemon > > also fixes the problem, suggesting that the long-lived open sockets > > are somehow the cause. The slowness also appears to be at the > > servers not the clients: the same kernel module and client daemon > > to a different filesystem and set of servers doesn't exhibit the > > performance degradation. > > > > Also, I should mention that the system config is a little different > > than usual. We have IO nodes mounting and unmounting the PVFS > > volume (and stopping the client daemon) with each user's job, > > which is fairly frequent, while on the login nodes, the volume > > remains mounted for a long time (and where the performance degrades). > > > > Our hunch here is that epoll or our use of epoll on the servers is > > somehow to blame. Maybe the file descriptors opened on the server > > for pvfs2-client-core are getting pushed down further and further > > into the epoll set, which for some reason is growing with new > > connections coming and going. This might be the case if we were > > failing to remove sockets from the set on disconnect, for example. > > It doesn't look like that's happening though, at least for normal > > disconnects. > > > > Its a PITA to debug, because the servers have to remain running for > > a long time (and the clients have to remain mounted) for the > > problem to be visible. Rob suggested I use strace on the servers > > to see what epoll was doing, and that showed some interesting > > results. Basically, it looks like epoll_wait takes significantly > > longer when clients are doing operations over the VFS, rather than > > with the pvfs2 admin tools. Also, strace reported epoll_ctl(..., > > EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS > > ops, and in those cases its returning EEXISTS. > > > > I noticed that we add a socket to the epoll set whenever we get a > > new connection, or a read or write is posted (enqueue_operation), > > but we only remove the socket from the epoll set on errors or > > disconnects. So why are we adding it for reads and writes? Any > > connected socket should already be in the set, no? I think this > > may be why I'm seeing EEXISTS with strace. > > > > Also, is it safe to check the error from epoll_ctl in > > BMI_socket_collection_[add|remove]? > > > > And finally, assuming PVFS is actually using epoll calls properly, > > does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would > > cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant > > to? Googling epoll and SUSE 2.6.5 isn't turning up anything... > > > > Thanks, > > -sam > > > > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
