On Oct 5, 2007, at 12:14 PM, Sam Lang wrote:
On Oct 5, 2007, at 10:49 AM, Sam Lang wrote:
The obvious and easy fix is to have bmi-tcp return true from
DROP_ADDR_QUERY for all address references. As far as I can tell,
the only thing we save by keeping them around is a little memory
allocation (the socket gets closed either way).
This suggested fix isn't right. The DEC_ADDR_REF which decrements
the refcount to zero, is invoked after sending the final response,
but that's usually before the client (in the case of the admin
tools) closes the connection. It looks like its the
tcp_forget_addr in the bmi method that needs to call back out to
the bmi wrapper layer to remove the reference from the list. I can
call BMI_set_info(addr, BMI_TCP_CLOSE_SOCKET) from tcp_forget_addr,
but that seems a bit backwards...
Actually it looks like we just need a companion function for
bmi_method_addr_reg_callback.
-sam
-sam
In the changes I've been working on to get multiple address
support in BMI, I've already replaced the linked list with a
hashtable, which wouldn't have made the problem go away, but the
degradation wouldn't have been quite as bad (may have made it
harder to find, actually). Maybe its time to add some profiling
info (perf stats?) to our basic list, queue and hash structures
that would tell us how big they're getting.
Anyway, thanks to all for contributing to the debugging process
for this one.
-sam
On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
Hi All,
I've been trying to debug a problem with PVFS, where performance
degrades slowly with a long-lived (weeks and months) PVFS
volume. The degradation is significant -- simple metadata
operations are an order of magnitude slower after a month or so.
The behavior turns out to only occur with the VFS and pvfs2-
client daemon: performance of the admin tools (pvfs2-touch,
pvfs2-rm, etc.) to the same set of servers remains good.
Restarting the client daemon also fixes the problem, suggesting
that the long-lived open sockets are somehow the cause. The
slowness also appears to be at the servers not the clients: the
same kernel module and client daemon to a different filesystem
and set of servers doesn't exhibit the performance degradation.
Also, I should mention that the system config is a little
different than usual. We have IO nodes mounting and unmounting
the PVFS volume (and stopping the client daemon) with each
user's job, which is fairly frequent, while on the login nodes,
the volume remains mounted for a long time (and where the
performance degrades).
Our hunch here is that epoll or our use of epoll on the servers
is somehow to blame. Maybe the file descriptors opened on the
server for pvfs2-client-core are getting pushed down further and
further into the epoll set, which for some reason is growing with
new connections coming and going. This might be the case if we
were failing to remove sockets from the set on disconnect, for
example. It doesn't look like that's happening though, at least
for normal disconnects.
Its a PITA to debug, because the servers have to remain running
for a long time (and the clients have to remain mounted) for the
problem to be visible. Rob suggested I use strace on the servers
to see what epoll was doing, and that showed some interesting
results. Basically, it looks like epoll_wait takes significantly
longer when clients are doing operations over the VFS, rather
than with the pvfs2 admin tools. Also, strace reported epoll_ctl
(..., EPOLL_CTL_ADD, ...)) getting called a few times, even for
the VFS ops, and in those cases its returning EEXISTS.
I noticed that we add a socket to the epoll set whenever we get a
new connection, or a read or write is posted (enqueue_operation),
but we only remove the socket from the epoll set on errors or
disconnects. So why are we adding it for reads and writes? Any
connected socket should already be in the set, no? I think this
may be why I'm seeing EEXISTS with strace.
Also, is it safe to check the error from epoll_ctl in
BMI_socket_collection_[add|remove]?
And finally, assuming PVFS is actually using epoll calls
properly, does anyone know of epoll bugs on a SUSE 2.6.5 kernel
that would cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do
what its meant to? Googling epoll and SUSE 2.6.5 isn't turning
up anything...
Thanks,
-sam
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers