Re: [Openais] [PATCH 2/2] Avoid copying buffers where possible in the Infiniband driver

Zane Bitter Mon, 04 Apr 2011 18:47:04 -0700

On 2011/04/04, at 16:49, Steven Dake wrote:

> On 03/30/2011 07:52 PM, Zane Bitter wrote:
>> 
>> OK, so it sounds like the real issue is that we still have an extra memcpy 
>> going on, in message_handler_mcast() at totemsrp.c:3861. In the iba case we 
>> can avoid this copy by simply keeping the receive buffer around. In the 
>> udp/udpa case we still need to allocate a new buffer with 
>> totemsrp_buffer_alloc() and copy the data into it, as is happening now. 
>> 
>> Possible solution 1: a new totemsrp_buffer_copy() function, implemented as 
>> buffer_alloc()+memcpy() for udp/udpa, but just snagging a reference to the 
>> existing buffer and returning it for iba.
>> Possible solution 2: modify the udp/udpa drivers to allocate a buffer with 
>> buffer_alloc() and receive data directly into it, then pass responsibility 
>> for it to totemsrp when we deliver. This saves a copy, but may be tricky to 
>> implement without a mutex when multiple threads are involved; I need to 
>> think more about it.
>> 
>> But in order to effect this, as you say, we need to be able to transmit or 
>> receive and transmit from the same buffer in totemiba, and we need to make 
>> sure that when we call buffer_release() it gets reposted to the receive 
>> queue if that is where it originated. Or, better, we just need to make sure 
>> that *some* buffer gets reposted to the receive queue after we receive a 
>> packet so that we still are able to receive new packets, even if we are 
>> holding a large number of buffers in totemsrp.c.
>> 
>> Is this an accurate summary of the problem?
>> 
>> The totemiba_mcast_flush_send() implementation in my patch is wrong and 
>> still needs to be fixed, since the buffer it is passed is allocated on the 
>> stack, not by buffer_alloc(). This can continue to allocate a send buffer 
>> internally and memcpy() the data into it as was happening before and as 
>> we're continuing to do with totemiba_token_send().
>> 
>> Assuming this is substantially correct, I think I now understand the 
>> implementation well enough to start thinking about the threading, which is 
>> the _really_ tricky part ;)
>> 
> 
> Right on target!


Cool, we are all back on the same page then :)

> I believe solution two can work by merging the send and receive buffers
> into one data structure and having a marker indicate if it is a send
> buffer or receive buffer.  This would allow the buffer_release api to
> then post to the recv buffer in the case of a recv buffer release, or
> add to the send list, if it is a send buffer.

Yes, I am leaning toward solution two now also, with one caveat: aren't we at 
risk of buffer starvation if we don't repost _something_ to the receive queue 
as soon as we receive a packet? With a strict segregation of send/receive 
buffers a large burst of traffic could put us in a situation with all of the 
receive buffers sitting in the send queue waiting for a token that can now 
never be received.

> In regards to threading, you will probably need one list and a mutex.
> tls would be interesting, but I don't think it would work after more
> consideration.
> 
> All recv buffers and freeing of those receive buffers occur on one
> thread (the main program's thread of execution).  Most totempg_mcast
> calls are executed from within ipc which are all separate threads.  This
> triggers an allocation from the ipc thread.  If a buffer is freed, it is
> freed from the main program's thread (via messages_free which is
> triggered via he poll event loop).  Assuming you used tls to make
> separate lists, All allocs would be freed into one list that was never
> accessible from the allocing ipc threads.

That's what I thought at first too. But I have been toying with an 
implementation where if a buffer is deallocated from a different thread to the 
one whence it was allocated, it is just free()ed. Otherwise, the free list is 
thread-local. This seems like it would be fast in the more common case, and not 
substantially slower than it is now in the general case (i.e. it effectively 
just degrades to malloc()/free()).

My one reservation with this is that it leaves the free list implementation 
with threading semantics that are not obvious, so it's more of a potential 
maintainability issue than the mutex.

cheers,
Zane.
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [PATCH 2/2] Avoid copying buffers where possible in the Infiniband driver

Reply via email to