Re: [Openais] Corosync netmalloc TODO item

Zane Bitter Wed, 09 Mar 2011 20:58:43 -0800

On 2011/03/02, at 12:50, Steven Dake wrote:

> On 03/01/2011 05:50 PM, Zane Bitter wrote:
>> Once more, to the list this time. It seems the Reply-To header is now 
>> missing again.
>> 
>> On 2011/03/01, at 12:48, Steven Dake wrote:
>> 
>>> One more note totemsrp.c also uses free on these frames (which should
>>> have a corresponding free call down through the
>>> totemrrp/totemnet/totemiba+totemudp+totemudpu layers.
>>> 
>>> A bit more on this point as I was thinking about it.  An IBA frame is
>>> limited to 2048 bytes or 4096 bytes depending on the kernel driver.  In
>>> order to use a buffer to send packets, the buffer must be posted to the
>>> send queue (ibv_post_send).  Once a buffer has been posted, it may not
>>> be posted again until it is processed by the hardware.  ibverbs delivers
>>> an event when a posted buffer is processed by the hardware via a
>>> completion queue (see mcast_cq_send_event_fn).
>> 
>> Interesting... the man page for ibv_post_send() says that "The buffers used 
>> by a WR can only be safely reused after WR the request is fully executed and 
>> a work completion has been retrieved from the corresponding completion queue 
>> (CQ)", which is open to interpretation of the word "reuse". Obviously you 
>> can't change the data and reuse the buffer for a different frame before the 
>> original one has been sent. But can you enqueue it again with the _same_ 
>> data?
>> 
> 
> With netmalloc I hadn't thought about the rrp case.
> 
> I believe the buffer can be posted to multiple queues.  The reason it
> can't be "reused" is because what the RDMA hardware is actually doing is
> a remote dma operation on the hardware.  If you were to queue the frame
> in the hardware, then make changes before getting the transmitted event,
> the hardware may end up transmitting a partially changed buffer.
> 
> This does create special problems for the rrp case - because rrp must
> allocate one set of frames in iba which act as one global pool (vs the
> current model where there are two separate pools per ring).
> 
>> The reason I ask is that the active rrp algorithm sends the token to all 
>> non-faulty interfaces. At the moment, the iba driver is doing a memcpy() for 
>> each of these; if it still requires a separate buffer for each outgoing 
>> frame then the best we can do is reduce the number of memcpy() calls by 1 
>> (for the n=1 case that's still a 100% reduction, which is not nothing). I 
>> think it would also require a different interface to the totemnet_malloc() 
>> function.
>> 
>>> A reference count is not needed for totemiba frames because all buffers
>>> are "preallocated" (required by RDMA design) so a totemrrp_free (X)
>>> operationn, which would call totemnet_free (X) which would call
>>> totemiba_free (X) would be a no op.
>>> 
>>> One area I went wrong when I wrote the iba code originally is I
>>> separated the send and receive buffer data structures into two separate
>>> free lists with two separate data structures.  This results in needless
>>> complication and will have to be merged into one "free list" from which
>>> prepared buffers can be retrieved and posted and then put back to.  The
>>> reason is because of how the memory protection domains work (a technical
>>> detail of rdma) wouuld limit the ability for the software to work
>>> properly with the current setup and a netmallocing feature.  But before
>>> heading down this road, I'd focus instead on keeping the current
>>> totemiba behavior (of the memcpy) and get the rest of the interfaces in
>>> shape.
>>> 
>>> Regards
>>> -steve
>> 
>> I'm happy to go ahead and implement the first patch, but I'm also trying to 
>> get my head around the iba stuff because it seems like how that works could 
>> potentially affect what the interface to the totemnet_malloc() function 
>> needs to be.
>> 
>> cheers,
>> Zane.
>> _______________________________________________
>> Openais mailing list
>> [email protected]
>> https://lists.linux-foundation.org/mailman/listinfo/openais


OK, finally found time to test that first patch :)

I think I have a fairly good idea of how to tackle the iba change. One 
question: does the change to a global pool (instead of separate pools for each 
ring) mean we need to consider thread-safety when adding/removing buffers 
to/from the free list, or are all rings handled by the same thread?

thanks,
Zane.
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] Corosync netmalloc TODO item

Reply via email to