The halloween bug (what I'm calling it -- its been haunting us for a
while now) is that we're adding address references to the bmi address
list, and never removing them. In the prelude state machine, we make
a BMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates through
all the addresses in the reference list. Its this step that is
causing the slowdown. As new connections are made addr refs get
added to the list and never removed, so the pvfs2-client-core addr
ref ends up at the bottom of a very long list.
The addr refs aren't getting removed, because in BMI_set_info(addr,
BMI_DEC_ADDR_REF) -- called from final_response -- the code queries
the bmi_tcp method on whether the address should be removed
BMI_tcp_get_info(BMI_DROP_ADDR_QUERY). This function always returns
false (don't drop), unless there was a bmi error somewhere (ECANCEL
is probably the only one that happens in practice -- due to a timeout).
Since our state actions block the main server thread, this caused
degradation for all requests received during processing of requests
from a long-lived socket. New connections hitting the server at
different times would have been fine though, which is what I was
seeing with my tests.
The obvious and easy fix is to have bmi-tcp return true from
DROP_ADDR_QUERY for all address references. As far as I can tell,
the only thing we save by keeping them around is a little memory
allocation (the socket gets closed either way).
In the changes I've been working on to get multiple address support
in BMI, I've already replaced the linked list with a hashtable, which
wouldn't have made the problem go away, but the degradation wouldn't
have been quite as bad (may have made it harder to find, actually).
Maybe its time to add some profiling info (perf stats?) to our basic
list, queue and hash structures that would tell us how big they're
getting.
Anyway, thanks to all for contributing to the debugging process for
this one.
-sam
On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
Hi All,
I've been trying to debug a problem with PVFS, where performance
degrades slowly with a long-lived (weeks and months) PVFS volume.
The degradation is significant -- simple metadata operations are an
order of magnitude slower after a month or so. The behavior turns
out to only occur with the VFS and pvfs2-client daemon:
performance of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the
same set of servers remains good. Restarting the client daemon
also fixes the problem, suggesting that the long-lived open sockets
are somehow the cause. The slowness also appears to be at the
servers not the clients: the same kernel module and client daemon
to a different filesystem and set of servers doesn't exhibit the
performance degradation.
Also, I should mention that the system config is a little different
than usual. We have IO nodes mounting and unmounting the PVFS
volume (and stopping the client daemon) with each user's job,
which is fairly frequent, while on the login nodes, the volume
remains mounted for a long time (and where the performance degrades).
Our hunch here is that epoll or our use of epoll on the servers is
somehow to blame. Maybe the file descriptors opened on the server
for pvfs2-client-core are getting pushed down further and further
into the epoll set, which for some reason is growing with new
connections coming and going. This might be the case if we were
failing to remove sockets from the set on disconnect, for example.
It doesn't look like that's happening though, at least for normal
disconnects.
Its a PITA to debug, because the servers have to remain running for
a long time (and the clients have to remain mounted) for the
problem to be visible. Rob suggested I use strace on the servers
to see what epoll was doing, and that showed some interesting
results. Basically, it looks like epoll_wait takes significantly
longer when clients are doing operations over the VFS, rather than
with the pvfs2 admin tools. Also, strace reported epoll_ctl(...,
EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
ops, and in those cases its returning EEXISTS.
I noticed that we add a socket to the epoll set whenever we get a
new connection, or a read or write is posted (enqueue_operation),
but we only remove the socket from the epoll set on errors or
disconnects. So why are we adding it for reads and writes? Any
connected socket should already be in the set, no? I think this
may be why I'm seeing EEXISTS with strace.
Also, is it safe to check the error from epoll_ctl in
BMI_socket_collection_[add|remove]?
And finally, assuming PVFS is actually using epoll calls properly,
does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
to? Googling epoll and SUSE 2.6.5 isn't turning up anything...
Thanks,
-sam
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers