Re: Rdma IO state transitions [Was: AsynchIO state transition]

Aaron Fabbri Wed, 15 Sep 2010 21:54:36 -0700

On Wed, Sep 15, 2010 at 9:04 AM, Andrew Stitcher <astitc...@redhat.com> wrote:
> On Tue, 2010-09-14 at 12:39 -0700, a fabbri wrote:
<snip>
>>
>> We could just use pthread_spin_lock() around your state transition
>> case statement, instead of spinning on the primitive compare and swap.
>>  Not sure how much more readable that would be, but it happens to be
>> more my style.  ;-)
>
> It's not clear to me this would necessarily avoid the contention, but it
> is certainly worth thinking about.
>


Let me try to clarify.  Was the performance problem you were seeing
before something like:

A. Before your boolCompAndExchange stuff, you used a scoped mutex which
is a pthread_mutex under the covers
B. pthread_mutexes put threads to sleep when there is
contention, which ends up hurting performance because your critical
sections are trivial (small with no sleeping or syscalls inside)

Is that about right?

Evidence of B would have been something like oprofile output which
showed kernel mutex activity.  (Linux userspace mutexes don't enter
the kernel unless there is contention).

Adaptive spin/sleep mutex implementations spin for a little while
before sleeping to avoid this problem.  These have existed places like
the FreeBSD kernel for a long time, but I don't think the linux
pthread_mutex (userspace) implementation has this yet.  In this case,
if you know your critical sections are "trivial", you can use
spinlocks.

My impression is that you've implemented your own sort of spinlock with the do
{ } while (comp_exchange) idiom, but a pthread_mutex may be slightly
more readable.  I'd expect both to perform similarly.  Hope that is clearer.

>> <snip>
>>
>> > The entire purpose of this state machine is to arbitrate between calls
>> > to dataEvent() and notifyPendingWrite() which can happen on different
>> > threads at the same time.
>>
>> Segue to a related question I have... Can you help me understand, or
>> just point to docs/code, the threads involved here?
>>
>> The upper layers ("app") call notifyPendingWrite() from whatever
>> thread they want.  dataEvent() gets called from the poller thread.  Is
>> it correct that there is typically only one poller thread per
>> Blah::AsynchIO instance?
>
> No, the IO threads are entirely independent of the number of
> connections. The rule is something like 1 IO thread per CPU (this needs
> to be revisited in the light of the NUMA nature of current multicore,
> multisocket machines).

Thanks for clarification.  Can you point me to where in the code these
threads are spawned?

> The IO threads all loop in parallel doing something like:
>
> Loop:
>  Wait for IO work to do
>  Do it.

All threads wait (select or epoll) on the same set of file descriptors, right?

Doesn't this mean that all IO threads race to service the same events?
 That is, all N threads wake up when an fd becomes readable?

In the linux/epoll case, does using EPOLLONESHOT mean that only one
thread gets woken up, or that all wake up, but they only wake up once
until the fd is rearmed.  (Didn't see this info in man pages).

> The upper layer threads can be entirely different threads (in the case
> of a client) or in the case of the broker an arbitrary IO thread
> (possibly one currently processing an event from this connection,
> possibly processing another connection). The broker does nearly all its
> work on IO threads. The exceptions are Timer events which are at least
> initiated on their own thread, and I think some management related work.
>
>>
>> Where does the --worker-threads=N arg to the CPP qpidd broker come into play?
>
> This overrides the default selection of number of IO threads.
>
>>
>> Finally--perhaps a can of worms-- but why does notifyPendingWrite()
>> exist, instead of just writeThis().  Is this part of the "bottom-up
>> IO" design?  I feel like having the app tell us it wants to write (so
>> call me back) is more complex than just having a writeThis(buf)
>> method.
>
> It currently works like this to ensure that the actual writes happen
> correctly serialised to the connection processing, ie when the callback
> for "please write something" happens we can be sure that nothing else is
> happening on the connection.

Humm.  You can serialize writes either way, right?  Just put them in a
queue (or return an error if connection is down).  Maybe I'm missing
the point.

It seems like:

aio->notifyPendingWrite()
callback idle()
idle calls queueWrite()
if queueWrite() cannot post send, it calls full() callback

Could be simplified as

aio->queueWrite()

with some changes in semantics and/or the introduction of a queue of
outgoing-but-not-posted sends

Thanks again,
Aaron

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscr...@qpid.apache.org

Re: Rdma IO state transitions [Was: AsynchIO state transition]

Reply via email to