Re: Rdma IO state transitions [Was: AsynchIO state transition]

Andrew Stitcher Thu, 16 Sep 2010 11:38:22 -0700

On Wed, 2010-09-15 at 21:54 -0700, Aaron Fabbri wrote:
> On Wed, Sep 15, 2010 at 9:04 AM, Andrew Stitcher <astitc...@redhat.com> wrote:
> > On Tue, 2010-09-14 at 12:39 -0700, a fabbri wrote:
> <snip>
> >>
> >> We could just use pthread_spin_lock() around your state transition
> >> case statement, instead of spinning on the primitive compare and swap.
> >>  Not sure how much more readable that would be, but it happens to be
> >> more my style.  ;-)
> >
> > It's not clear to me this would necessarily avoid the contention, but it
> > is certainly worth thinking about.
> >
> 
> Let me try to clarify.  Was the performance problem you were seeing
> before something like:
> 
> A. Before your boolCompAndExchange stuff, you used a scoped mutex which
> is a pthread_mutex under the covers
> B. pthread_mutexes put threads to sleep when there is
> contention, which ends up hurting performance because your critical
> sections are trivial (small with no sleeping or syscalls inside)


This is likely the case (although it was a little while ago so I can't
be 100% sure).

Essentially the previous code structure had a ScopedMutex where the
current loop around boolCompareAndSwap() is.

> ...

> Evidence of B would have been something like oprofile output which
> showed kernel mutex activity.  (Linux userspace mutexes don't enter
> the kernel unless there is contention).

I think we had systemtap output showing that these mutexes entered the
kernel a lot, which is equivalent to what you say above.

> 
> Adaptive spin/sleep mutex implementations spin for a little while
> before sleeping to avoid this problem.  These have existed places like
> the FreeBSD kernel for a long time, but I don't think the linux
> pthread_mutex (userspace) implementation has this yet.  In this case,
> if you know your critical sections are "trivial", you can use
> spinlocks.

It is possible that there was some small redesign which reduced the work
in the "lock" region which is responsible for most of the improvement.
Of course if we revert to using Mutex again, we may find that there is
now actually no performance degradation.

> 
> My impression is that you've implemented your own sort of spinlock with the do
> { } while (comp_exchange) idiom, but a pthread_mutex may be slightly
> more readable.  I'd expect both to perform similarly.  Hope that is clearer.

As I implied before, measurements will tell us!

> ...
> >
> > No, the IO threads are entirely independent of the number of
> > connections. The rule is something like 1 IO thread per CPU (this needs
> > to be revisited in the light of the NUMA nature of current multicore,
> > multisocket machines).
> 
> Thanks for clarification.  Can you point me to where in the code these
> threads are spawned?

On the broker they are spawned in the Broker::run() method I think. In
the client in the ConnectionImpl class.

> 
> > The IO threads all loop in parallel doing something like:
> >
> > Loop:
> >  Wait for IO work to do
> >  Do it.
> 
> All threads wait (select or epoll) on the same set of file descriptors, right?
> 
> Doesn't this mean that all IO threads race to service the same events?
>  That is, all N threads wake up when an fd becomes readable?

My understanding is that the kernel deals with this by only waking a
single epoll_wait() up if there is only a single event available. As far
as I know we've never observed a thundering stampede effect.

> 
> In the linux/epoll case, does using EPOLLONESHOT mean that only one
> thread gets woken up, or that all wake up, but they only wake up once
> until the fd is rearmed.  (Didn't see this info in man pages).

The semantic of EPOLLONESHOT is that epoll_wait() only return the fd
once until it is rearmed when it may be returned again. How the kernel
itself deals with this I don't know.

> 
> > The upper layer threads can be entirely different threads (in the case
> > of a client) or in the case of the broker an arbitrary IO thread
> > (possibly one currently processing an event from this connection,
> > possibly processing another connection). The broker does nearly all its
> > work on IO threads. The exceptions are Timer events which are at least
> > initiated on their own thread, and I think some management related work.
> >
> >>
> >> Where does the --worker-threads=N arg to the CPP qpidd broker come into 
> >> play?
> >
> > This overrides the default selection of number of IO threads.
> >
> >>
> >> Finally--perhaps a can of worms-- but why does notifyPendingWrite()
> >> exist, instead of just writeThis().  Is this part of the "bottom-up
> >> IO" design?  I feel like having the app tell us it wants to write (so
> >> call me back) is more complex than just having a writeThis(buf)
> >> method.
> >
> > It currently works like this to ensure that the actual writes happen
> > correctly serialised to the connection processing, ie when the callback
> > for "please write something" happens we can be sure that nothing else is
> > happening on the connection.
> 
> Humm.  You can serialize writes either way, right?  Just put them in a
> queue (or return an error if connection is down).  Maybe I'm missing
> the point.
> 
> It seems like:
> 
> aio->notifyPendingWrite()
> callback idle()
> idle calls queueWrite()
> if queueWrite() cannot post send, it calls full() callback
> 
> Could be simplified as
> 
> aio->queueWrite()
> 
> with some changes in semantics and/or the introduction of a queue of
> outgoing-but-not-posted sends

I think one substantive advantage of the current scheme is that we don't
need to lock access to the write queue.

Another aspect is that it may not be known until the callback which of
potentially) many connections will receive a given frame. At least this
was true when we started doing this. It may be that the allocation of
messages between connections is done differently now.

This is relevant to a queue that is subscribed by many connections. When
there is a message ready to send from the queue, roughly speaking you
need to tell every connection that might take the message that there is
one, then you give the message to one that calls back on idle().
Remember that not all connections might be idle. 

Writing this mow it does seem that this approach is subject to the
thundering herd problem too. (Patches welcome!)

Andrew



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscr...@qpid.apache.org

Re: Rdma IO state transitions [Was: AsynchIO state transition]

Reply via email to