Andrew,

Thanks for the response to my post.

What would be the consequences of me not registering the FD of a Rdma::AsynchIO with the Poller?

Instead I would have Rdma::AsynchIO internally poll it's own completion queue (CQ) for the first message and then call the "Rdma::AsynchIO.DataEvent -> Rdma::AsynchIO.ProcessCompletion -> QueuePair.GetNextEvent" chain that the Dispatcher/Poller would call. When that's complete Rdma::AsynchIO would go back to polling for another first message and so on.

Since each Rdma::AsynchIO would be handling its own polling, message flow would be serialized via the CQ.

You mention that the Poller "...serializes all events occurring on a given registered Handle." Out of curiosity is this being done by the scoped locks on the Mutex DispatchHandle.stateLock? (For example in DispatchHandle.processEvent).

Regards,

Greg




On Thu, 23 Apr 2009, Andrew Stitcher wrote:

On Thu, 2009-04-23 at 14:10 -0400, gregory james marsh wrote:
Hello,

I'm interested in experimenting with Qpid Rdma in hopes of further
lowering message latency.  I've noticed that the Qpid Rdma implementation
uses a (nonblocking) completion channel with the InfiniBand verbs
completion queue (cq).  I'd like to see if not using the completion
channel and polling the cq only would yield benefits.

As evidence of the performance potential I ran a test with
ibv_rc_pingpong, the raw IB verbs latency test that comes with the OFED
distribution.  This code uses the same reliable connection, send/recv
QueuePair modes that Qpid Rdma uses.  I've attached a file of my results.
I ran ibv_rc_pingpong with and without a completion channel.  In
non-completion channel mode, message latency reduced by 10 usec for all
sizes tested.  For message sizes <= 16K this means ~25% or greater
improvement.  For message sizes <= 512K this yields ~50% improvement.  It
would be interesting to see how this would translate into Qpid.

I've modified Rdma::AsynchIO.processCompletions and Rdma::QueuePair to
work without a completion channel.  However I'm at a loss as to how to
modify the Poller which epolls the file descriptor underlying the
completion channel.  Here is how I see the current event chain of
dependency in Qpid:

Dispatcher.run
-> Poller.wait (EpollPoller)
   -> epoll_wait (I want to omit Rdma::QueuePair.cchannel's fd from the
epollfd set)
     -> DispatchHandle.ProcessEvent
       -> Rdma::AsynchIO.DataEvent
         -> Rdma::AsynchIO.ProcessCompletions (I want to omit
ibv_get_cq_event on Rdma::QueuePair.cchannel)
           -> Rdma::QueuePair.GetNextEvent
             -> ibv_poll_cq IB verb until no more events

My main question is how (where?) would I best substitute ibv_poll_cq() in
place of epoll_wait() to drive event polling for Rdma::AsynchIO?

This is going to be a difficult thing to do as the entire flow of
control and of messages in the qpid c++ broker is driven by the ability
or not to read/write to fds registered in the Poller.

If you want to experiment with polling the cq instead of just giving up
when there are no more events to be polled I suggest busy polling for a
little longer and see how that improves your overall latencies.
Obviously it does nothing for the first message.


I also have some miscellaneous questions regarding the Poller:

I'd still need to maintain the epoll framework as it seems there are other
objects registering (startWatch) file descriptors with epoll.  I've
noticed that regular AsynchIO and Rdma::ConnectionManager are.  What else
is registering  its file descriptor with the EpollPoller?

Conceptually everything that the broker does is triggered by an event in
the Poller (after start up that is), The Poller is run on multiple
threads and serialises all events occuring on a given registered Handle.

There are also a small number of timer threads, but they should be going
down to just 1 thread soon, and they should interact with the rest of
the code by causing an event in the Poller code.


How many Poller objects/threads are instantiated?  I ran with gdb and
noticed that different threads were calling Poller.wait().

By default a Poller thread is created for every CPU (+1 as a heuristic
that improved things when we last tested).

You can change the number of Poller threads with the option
"--worker-threads"

Andrew



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Reply via email to