Awesome! Great job, Sam!
Perhaps, you could also crossport the hash table fixes also along with
the fix for this?
thanks!
Murali

On 10/5/07, Sam Lang <[EMAIL PROTECTED]> wrote:
>
> The halloween bug (what I'm calling it -- its been haunting us for a
> while now) is that we're adding address references to the bmi address
> list, and never removing them.  In the prelude state machine, we make
> a BMI_set_info(addr, BMI_INC_ADDR_REF) call, which iterates through
> all the addresses in the reference list.  Its this step that is
> causing the slowdown.  As new connections are made addr refs get
> added to the list and never removed, so the pvfs2-client-core addr
> ref ends up at the bottom of a very long list.
>
> The addr refs aren't getting removed, because in BMI_set_info(addr,
> BMI_DEC_ADDR_REF)  -- called from final_response -- the code queries
> the bmi_tcp method on whether the address should be removed
> BMI_tcp_get_info(BMI_DROP_ADDR_QUERY).  This function always returns
> false (don't drop), unless there was a bmi error somewhere (ECANCEL
> is probably the only one that happens in practice -- due to a timeout).
>
> Since our state actions block the main server thread, this caused
> degradation for all requests received during processing of requests
> from a long-lived socket.  New connections hitting the server at
> different times would have been fine though, which is what I was
> seeing with my tests.
>
> The obvious and easy fix is to have bmi-tcp return true from
> DROP_ADDR_QUERY for all address references.  As far as I can tell,
> the only thing we save by keeping them around is a little memory
> allocation (the socket gets closed either way).
>
> In the changes I've been working on to get multiple address support
> in BMI, I've already replaced the linked list with a hashtable, which
> wouldn't have made the problem go away, but the degradation wouldn't
> have been quite as bad (may have made it harder to find, actually).
> Maybe its time to add some profiling info (perf stats?) to our basic
> list, queue and hash structures that would tell us how big they're
> getting.
>
> Anyway, thanks to all for contributing to the debugging process for
> this one.
>
> -sam
>
> On Sep 26, 2007, at 6:00 PM, Sam Lang wrote:
>
> >
> > Hi All,
> >
> > I've been trying to debug a problem with PVFS, where performance
> > degrades slowly with a long-lived (weeks and months) PVFS volume.
> > The degradation is significant -- simple metadata operations are an
> > order of magnitude slower after a month or so.  The behavior turns
> > out to only occur with the VFS and pvfs2-client daemon:
> > performance of the admin tools (pvfs2-touch, pvfs2-rm, etc.) to the
> > same set of servers remains good.  Restarting the client daemon
> > also fixes the problem, suggesting that the long-lived open sockets
> > are somehow the cause.  The slowness also appears to be at the
> > servers not the clients: the same kernel module and client daemon
> > to a different filesystem and set of servers doesn't exhibit the
> > performance degradation.
> >
> > Also, I should mention that the system config is a little different
> > than usual.  We have IO nodes mounting and unmounting the PVFS
> > volume  (and stopping the client daemon) with each user's job,
> > which is fairly frequent, while on the login nodes, the volume
> > remains mounted for a long time (and where the performance degrades).
> >
> > Our hunch here is that epoll or our use of epoll on the servers is
> > somehow to blame.  Maybe the file descriptors opened on the server
> > for pvfs2-client-core are getting pushed down further and further
> > into the epoll set, which for some reason is growing with new
> > connections coming and going.  This might be the case if we were
> > failing to remove sockets from the set on disconnect, for example.
> > It doesn't look like that's happening though, at least for normal
> > disconnects.
> >
> > Its a PITA to debug, because the servers have to remain running for
> > a long time (and the clients have to remain mounted) for the
> > problem to be visible.  Rob suggested I use strace on the servers
> > to see what epoll was doing, and that showed some interesting
> > results.  Basically, it looks like epoll_wait takes significantly
> > longer when clients are doing operations over the VFS, rather than
> > with the pvfs2 admin tools.  Also, strace reported epoll_ctl(...,
> > EPOLL_CTL_ADD, ...)) getting called a few times, even for the VFS
> > ops, and in those cases its returning EEXISTS.
> >
> > I noticed that we add a socket to the epoll set whenever we get a
> > new connection, or a read or write is posted (enqueue_operation),
> > but we only remove the socket from the epoll set on errors or
> > disconnects.  So why are we adding it for reads and writes?  Any
> > connected socket should already be in the set, no?  I think this
> > may be why I'm seeing EEXISTS with strace.
> >
> > Also, is it safe to check the error from epoll_ctl in
> > BMI_socket_collection_[add|remove]?
> >
> > And finally, assuming PVFS is actually using epoll calls properly,
> > does anyone know of epoll bugs on a SUSE 2.6.5 kernel that would
> > cause epoll_ctl(..., EPOLL_CTL_DEL, ....) to not do what its meant
> > to?  Googling epoll and SUSE 2.6.5 isn't turning up anything...
> >
> > Thanks,
> > -sam
> >
>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to