Re: [OMPI devel] Threaded progress for CPCs
One more point that Pasha and I hashed out yesterday in IM... To avoid the problem of posting a short handshake buffer to already- existing SRQs, we will only do the extra handshake if there are PPRQ's in receive_queues. The handshake will go across the smallest PPRQ, and represent all QPs in receive_queues (even the SRQs). If there are no PPRQ's in the receive_queues value, we'll just skip the handshake and rely on IB's SRQ RNR retransmitting to fix any race conditions. One point that needs clarification: whether IBCM and RDMACM *require* posting receive buffers on the new QP's. If so, this scheme will run into trouble because we do not want to post any buffers on SRQs; that gets racy and difficult to synchronize right (especially if multiple remote peers are simultaneously trying to connect to a single SRQ). I'll check this out today or tomorrow. We'll have to re-visit this when iWARP NICs start supporting SRQ, but if the above assumption is true (no need to post any receive buffers for IBCM and RDMACM), it will be good enough for v1.3. On May 20, 2008, at 12:37 PM, Jeff Squyres wrote: Ok, I think we're mostly converged on a solution. This might not get implemented immediately (got some other pending v1.3 stuff to bug fix, etc.), but it'll happen for v1.3. - endpoint creation will mpool alloc/register a small buffer for handshake - cpc does not need to call _post_recvs()); instead, it can just post the single small buffer on each BSRQ QP (from the small buffer on the endpoint) - cpc will call _connected() (in the main thread, not the CPC progress thread) when all BSRQ QPs are connected - if _post_recvs() was previously called, do the normal "finish setting up" stuff and declare the endpoint CONNECTED - if _post_recvs() was not previously called, then: - call _post_recvs() - send a short CTS message on the 1st BSRQ QP - wait for CTS from peer - when both CTS from peer has arrived *and* we have sent our CTS, declare endpoint CONNECTED Doing it this way adds no overhead to OOB/XOOB (who don't need this extra handshake). I think the code can be factored nicely to make this not too complicated. I'll work on this once I figure out the memory corruption I'm seeing in the receive_queues patch... Note that this addresses the wireup multi-threading issues -- not iWarp SRQ issues. We'll tackle those separately, and possibly not for the initial v1.3.0 release. On May 20, 2008, at 6:02 AM, Gleb Natapov wrote: On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote: 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. This is probably the most viable alternative -- it would be easiest if we did this for all CPC's, not just for IBCM: - for PPRQ: CPCs only post a small number of receive buffers, suitable for another handshake that will run in the upper-level openib BTL - for SRQ: CPCs don't post anything (because the SRQ already "belongs" to the upper level openib BTL) Do we have a BSRQ restriction that there *must* be at least one PPRQ? No. We don't have such restriction and I wouldn't want to add it. If so, we could always run the upper-level openib BTL really-post- the- buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., have the CPC post a single receive on this QP -- see below), which would make things much easier. If we don't already have this restriction, would we mind adding it? We have one PPRQ in our default receive_queues value, anyway. If there is not PPRQ then we can relay on RNR/retransmit logic in case there is not enough buffer in SRQ. We do that anyway in openib BTL code. With this rationale, once the CPC says "ok, all BSRQ QP's are connected", then _endpoint.c can run a CTS handshake to post the "real" buffers, where each side does the following: - CPC calls _endpoint_connected() to tell the upper level BTL that it is fully connected (the function is invoked in the main thread) - _endpoint_connected() posts all the "real" buffers to all the BSRQ QP's on the endpoint - _endpoint_connected() then sends a CTS control message to remote peer via smallest RC PPRQ - upon receipt of CTS: - release the buffer (***) - set endpoint state of CONNECTED and let all pending messages flow... (as it happens today) So it actually doesn't even have to be a handshake -- it's just an additional CTS sent over the newly-created RC QP. Since it's RC, we don't have to do much -- just wait for the CTS to kn
Re: [OMPI devel] Threaded progress for CPCs
Ok, I think we're mostly converged on a solution. This might not get implemented immediately (got some other pending v1.3 stuff to bug fix, etc.), but it'll happen for v1.3. - endpoint creation will mpool alloc/register a small buffer for handshake - cpc does not need to call _post_recvs()); instead, it can just post the single small buffer on each BSRQ QP (from the small buffer on the endpoint) - cpc will call _connected() (in the main thread, not the CPC progress thread) when all BSRQ QPs are connected - if _post_recvs() was previously called, do the normal "finish setting up" stuff and declare the endpoint CONNECTED - if _post_recvs() was not previously called, then: - call _post_recvs() - send a short CTS message on the 1st BSRQ QP - wait for CTS from peer - when both CTS from peer has arrived *and* we have sent our CTS, declare endpoint CONNECTED Doing it this way adds no overhead to OOB/XOOB (who don't need this extra handshake). I think the code can be factored nicely to make this not too complicated. I'll work on this once I figure out the memory corruption I'm seeing in the receive_queues patch... Note that this addresses the wireup multi-threading issues -- not iWarp SRQ issues. We'll tackle those separately, and possibly not for the initial v1.3.0 release. On May 20, 2008, at 6:02 AM, Gleb Natapov wrote: On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote: 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. This is probably the most viable alternative -- it would be easiest if we did this for all CPC's, not just for IBCM: - for PPRQ: CPCs only post a small number of receive buffers, suitable for another handshake that will run in the upper-level openib BTL - for SRQ: CPCs don't post anything (because the SRQ already "belongs" to the upper level openib BTL) Do we have a BSRQ restriction that there *must* be at least one PPRQ? No. We don't have such restriction and I wouldn't want to add it. If so, we could always run the upper-level openib BTL really-post- the- buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., have the CPC post a single receive on this QP -- see below), which would make things much easier. If we don't already have this restriction, would we mind adding it? We have one PPRQ in our default receive_queues value, anyway. If there is not PPRQ then we can relay on RNR/retransmit logic in case there is not enough buffer in SRQ. We do that anyway in openib BTL code. With this rationale, once the CPC says "ok, all BSRQ QP's are connected", then _endpoint.c can run a CTS handshake to post the "real" buffers, where each side does the following: - CPC calls _endpoint_connected() to tell the upper level BTL that it is fully connected (the function is invoked in the main thread) - _endpoint_connected() posts all the "real" buffers to all the BSRQ QP's on the endpoint - _endpoint_connected() then sends a CTS control message to remote peer via smallest RC PPRQ - upon receipt of CTS: - release the buffer (***) - set endpoint state of CONNECTED and let all pending messages flow... (as it happens today) So it actually doesn't even have to be a handshake -- it's just an additional CTS sent over the newly-created RC QP. Since it's RC, we don't have to do much -- just wait for the CTS to know that the remote side has actually posted all the receives that we expect it to have. Since the CTS flows over a PPRQ, there's no issue about receiving the CTS on an SRQ (because the SRQ may not have any buffers posted at any given time). Correct. Full handshake is not needed. The trick is to allocate those initial buffers in a smart way. IMO initial buffer should be very small (a couple of bytes only) and be preallocated on endpoint creation. This will solve locking problem. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Threaded progress for CPCs
Is it possible to have sane SRQ implementation without HW flow control? It seems pretty unlikely if the only available HW flow control is to terminate the connection. ;-) Even if we can get the iWARP semantics to work, this feels kinda icky. Perhaps I'm overreacting and this isn't a problem that needs to be fixed -- after all, this situation is no different than what happens after the initial connection, but it still feels icky. What is so icky about it? Sender is faster than a receiver so flow control kicks in. My point is that we have no real flow control for SRQ. 2. The CM progress thread posts its own receive buffers when creating a QP (which is a necessary step in both CMs). However, this is problematic in two cases: [skip] I don't like 1,2 and 3. :( 4. Have a separate mpool for drawing initial receive buffers for the CM-posted RQs. We'd probably want this mpool to be always empty (or close to empty) -- it's ok to be slow to allocate / register more memory when a new connection request arrives. The memory obtained from this mpool should be able to be returned to the "main" mpool after it is consumed. This is slightly better, but still... Agreed; my reactions were pretty much the same as yours. 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. This is probably the most viable alternative -- it would be easiest if we did this for all CPC's, not just for IBCM: - for PPRQ: CPCs only post a small number of receive buffers, suitable for another handshake that will run in the upper-level openib BTL - for SRQ: CPCs don't post anything (because the SRQ already "belongs" to the upper level openib BTL) Currently I Iwarp do not have SRQ at and and IMHO the SRQ in not possible without HW flow control So lets resolve the problem only for PPRQ ? Do we have a BSRQ restriction that there *must* be at least one PPRQ? No it is not such restriction. If so, we could always run the upper-level openib BTL really-post-the- buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., have the CPC post a single receive on this QP -- see below), which would make things much easier. If we don't already have this restriction, would we mind adding it? We have one PPRQ in our default receive_queues value, anyway. I don't see such reason to add such restrictions, at least for IB. We may add it for Iwarp only (actually we already have it for Iwarp)
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote: > >> 5. ...? > > What about moving posting of receive buffers into main thread. With > > SRQ it is easy: don't post anything in CPC thread. Main thread will > > prepost buffers automatically after first fragment received on the > > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more > > complicated. What if we'll prepost dummy buffers (not from free list) > > during IBCM connection stage and will run another three way handshake > > protocol using those buffers, but from the main thread. We will need > > to > > prepost one buffer on the active side and two buffers on the passive > > side. > > > This is probably the most viable alternative -- it would be easiest if > we did this for all CPC's, not just for IBCM: > > - for PPRQ: CPCs only post a small number of receive buffers, suitable > for another handshake that will run in the upper-level openib BTL > - for SRQ: CPCs don't post anything (because the SRQ already "belongs" > to the upper level openib BTL) > > Do we have a BSRQ restriction that there *must* be at least one PPRQ? No. We don't have such restriction and I wouldn't want to add it. > If so, we could always run the upper-level openib BTL really-post-the- > buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., > have the CPC post a single receive on this QP -- see below), which > would make things much easier. If we don't already have this > restriction, would we mind adding it? We have one PPRQ in our default > receive_queues value, anyway. If there is not PPRQ then we can relay on RNR/retransmit logic in case there is not enough buffer in SRQ. We do that anyway in openib BTL code. > > With this rationale, once the CPC says "ok, all BSRQ QP's are > connected", then _endpoint.c can run a CTS handshake to post the > "real" buffers, where each side does the following: > > - CPC calls _endpoint_connected() to tell the upper level BTL that it > is fully connected (the function is invoked in the main thread) > - _endpoint_connected() posts all the "real" buffers to all the BSRQ > QP's on the endpoint > - _endpoint_connected() then sends a CTS control message to remote > peer via smallest RC PPRQ > - upon receipt of CTS: >- release the buffer (***) >- set endpoint state of CONNECTED and let all pending messages > flow... (as it happens today) > > So it actually doesn't even have to be a handshake -- it's just an > additional CTS sent over the newly-created RC QP. Since it's RC, we > don't have to do much -- just wait for the CTS to know that the remote > side has actually posted all the receives that we expect it to have. > Since the CTS flows over a PPRQ, there's no issue about receiving the > CTS on an SRQ (because the SRQ may not have any buffers posted at any > given time). Correct. Full handshake is not needed. The trick is to allocate those initial buffers in a smart way. IMO initial buffer should be very small (a couple of bytes only) and be preallocated on endpoint creation. This will solve locking problem. -- Gleb.
Re: [OMPI devel] Threaded progress for CPCs
Jeff Squyres wrote: On May 19, 2008, at 4:44 PM, Steve Wise wrote: 1. Posting more at low watermark can lead to DoS-like behavior when you have a fast sender and a slow receiver. This is exactly the resource-exhaustion kind of behavior that a high quality MPI implementation is supposed to avoid -- we really should to throttle the sender somehow. 2. Resending ad infinitum simply eats up more bandwidth and takes away network resources (e.g., switch resources) that other, legitimate traffic. Particularly if the receiver doesn't dip into the MPI layer for many hours. So yes, it *works*, but it's definitely sub-optimal. The SRQ low water mark is simply an API method to allow applications to try and never hit the "we're totally out recv bufs" problem. That's a tool that I think is needed for srq users no matter what flow control method you use to try and avoid jeff's #1 item above. If you had these buffers available, why didn't you post them when the QP was created / this sender was added? Because you're trying to reduce memory requirements at the expense of under-provisioning the SRQ. If you don't want the transport to drop and retransmit, then you might want an algorithm to increase the low water mark during bursty periods. This mechanism *might* make sense if there was a sensible approach to know when to remove the "additional" buffers posted to an SRQ due to bursty traffic. But how do you know when that is? Thinking out loud: - keep the SRQ up to the low water mark as a normal course of events - increase the low water mark value as you get more and more "low water mark exceeded" events - decrease the low water mark as these events become less frequent. Dunno if this is worth the effort. And if you don't like RNR retry/TCP retrans approach, which is bad for reason #2 (and because TCP will eventually give up and reset the connection), then I think there needs to be some OMPI layer protocol to stop senders that are abusing the SRQ pool for whatever reason (too fast of a sender, sleeping beauty receiver never entering OMPI layer, whtaever). That implies a progress thread. If/when we add a progress thread, it will likely be for progressing long messages. Myricom and MVAPICH have shown that rapidly firing progress threads and problematic to performance. But even if you have that progress thread *only* wake up on the low watermark for the SRQ, you have two problems: - there still could be many inbound messages that will overflow the SRQ and/or even more could be inbound by the time your STOP message gets to everyone (gets even worse as the MPI job scales up in total number of processes) - in the case of a very large MPI job, sending the STOP message has obvious scalability problems (have to send it to everyone, which requires its own set of send buffers and WQEs/CQEs) Ok, STOP messages won't scale...dumb idea.
Re: [OMPI devel] Threaded progress for CPCs
On May 19, 2008, at 4:44 PM, Steve Wise wrote: 1. Posting more at low watermark can lead to DoS-like behavior when you have a fast sender and a slow receiver. This is exactly the resource-exhaustion kind of behavior that a high quality MPI implementation is supposed to avoid -- we really should to throttle the sender somehow. 2. Resending ad infinitum simply eats up more bandwidth and takes away network resources (e.g., switch resources) that other, legitimate traffic. Particularly if the receiver doesn't dip into the MPI layer for many hours. So yes, it *works*, but it's definitely sub-optimal. The SRQ low water mark is simply an API method to allow applications to try and never hit the "we're totally out recv bufs" problem. That's a tool that I think is needed for srq users no matter what flow control method you use to try and avoid jeff's #1 item above. If you had these buffers available, why didn't you post them when the QP was created / this sender was added? This mechanism *might* make sense if there was a sensible approach to know when to remove the "additional" buffers posted to an SRQ due to bursty traffic. But how do you know when that is? And if you don't like RNR retry/TCP retrans approach, which is bad for reason #2 (and because TCP will eventually give up and reset the connection), then I think there needs to be some OMPI layer protocol to stop senders that are abusing the SRQ pool for whatever reason (too fast of a sender, sleeping beauty receiver never entering OMPI layer, whtaever). That implies a progress thread. If/when we add a progress thread, it will likely be for progressing long messages. Myricom and MVAPICH have shown that rapidly firing progress threads and problematic to performance. But even if you have that progress thread *only* wake up on the low watermark for the SRQ, you have two problems: - there still could be many inbound messages that will overflow the SRQ and/or even more could be inbound by the time your STOP message gets to everyone (gets even worse as the MPI job scales up in total number of processes) - in the case of a very large MPI job, sending the STOP message has obvious scalability problems (have to send it to everyone, which requires its own set of send buffers and WQEs/CQEs) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Threaded progress for CPCs
Jeff Squyres wrote: On May 19, 2008, at 3:40 PM, Jon Mason wrote: iWARP needs preposted recv buffers (or it will drop the connection). So this isn't a good option. I was talking about SRQ only. You said above that iwarp does retransmit for SRQ. openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it reliably enough it can not be used with SRQ anyway. How iWARP adapters behave with respect to SRQ retransmit is 100% HW dependent. It was my understanding that it's at least the same as how TCP handles a dropped packet. The HW may do better than that. The HW can queue some of the receives internally or use the HW TCP stack to have it retransmit. Of course, this is a BAD thing to do. The SRQ "low- water marker" event is the best way to handle these cases. I disagree. I even think that the IB-retry-forever approach is bad. Here's why: 1. Posting more at low watermark can lead to DoS-like behavior when you have a fast sender and a slow receiver. This is exactly the resource-exhaustion kind of behavior that a high quality MPI implementation is supposed to avoid -- we really should to throttle the sender somehow. 2. Resending ad infinitum simply eats up more bandwidth and takes away network resources (e.g., switch resources) that other, legitimate traffic. Particularly if the receiver doesn't dip into the MPI layer for many hours. So yes, it *works*, but it's definitely sub-optimal. The SRQ low water mark is simply an API method to allow applications to try and never hit the "we're totally out recv bufs" problem. That's a tool that I think is needed for srq users no matter what flow control method you use to try and avoid jeff's #1 item above. And if you don't like RNR retry/TCP retrans approach, which is bad for reason #2 (and because TCP will eventually give up and reset the connection), then I think there needs to be some OMPI layer protocol to stop senders that are abusing the SRQ pool for whatever reason (too fast of a sender, sleeping beauty receiver never entering OMPI layer, whtaever). my 1/2 cent... Steve.
Re: [OMPI devel] Threaded progress for CPCs
On May 19, 2008, at 3:40 PM, Jon Mason wrote: iWARP needs preposted recv buffers (or it will drop the connection). So this isn't a good option. I was talking about SRQ only. You said above that iwarp does retransmit for SRQ. openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it reliably enough it can not be used with SRQ anyway. How iWARP adapters behave with respect to SRQ retransmit is 100% HW dependent. It was my understanding that it's at least the same as how TCP handles a dropped packet. The HW may do better than that. The HW can queue some of the receives internally or use the HW TCP stack to have it retransmit. Of course, this is a BAD thing to do. The SRQ "low- water marker" event is the best way to handle these cases. I disagree. I even think that the IB-retry-forever approach is bad. Here's why: 1. Posting more at low watermark can lead to DoS-like behavior when you have a fast sender and a slow receiver. This is exactly the resource-exhaustion kind of behavior that a high quality MPI implementation is supposed to avoid -- we really should to throttle the sender somehow. 2. Resending ad infinitum simply eats up more bandwidth and takes away network resources (e.g., switch resources) that other, legitimate traffic. Particularly if the receiver doesn't dip into the MPI layer for many hours. So yes, it *works*, but it's definitely sub-optimal. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 10:12:19PM +0300, Gleb Natapov wrote: > On Mon, May 19, 2008 at 01:52:22PM -0500, Jon Mason wrote: > > On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote: > > > On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote: > > > > >> 5. ...? > > > > >> > > > > > What about moving posting of receive buffers into main thread. With > > > > > SRQ it is easy: don't post anything in CPC thread. Main thread will > > > > > prepost buffers automatically after first fragment received on the > > > > > endpoint (in btl_openib_handle_incoming()). > > > > It still doesn't guaranty that we will not see RNR (as I understand we > > > > trying to resolve this problem for iwarp?!) > > > > > > > I don't think that iwarp has SRQ at all. And if it has then it should > > > > While Chelsio does not currently have an adapter that has SRQs, there are > > some other iWARP vendors that do have them. > > > > > have HW flow control for it too. I don't see what advantage SRQ without > > > flow control can provide over PPRQ. > > > > Technically, this is not flow control, it is a retransmit. iWARP can use > > the HW TCP stack to retransmit, but it will not have the "retransmit > > forever" ability that setting rnr_retry to 7 has for IB. > For how long will it try to retransmit before dropping connection. > > > > > > > So this solution will cost 1 buffer on each srq ... sounds acceptable > > > > for me. But I don't see too much > > > > difference compared to #1, as I understand we anyway will be need the > > > > pipe for communication with main thread. > > > > so why don't use #1 ? > > > What communication? No communication at all. Just don't prepost buffers > > > to SRQ during connection establishment. Problem solved (only for SRQ of > > > cause). > > > > iWARP needs preposted recv buffers (or it will drop the connection). So > > this isn't a good option. > I was talking about SRQ only. You said above that iwarp does retransmit for > SRQ. > openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it > reliably enough it can not be used with SRQ anyway. How iWARP adapters behave with respect to SRQ retransmit is 100% HW dependent. The HW can queue some of the receives internally or use the HW TCP stack to have it retransmit. Of course, this is a BAD thing to do. The SRQ "low-water marker" event is the best way to handle these cases. Thanks, Jon > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 01:52:22PM -0500, Jon Mason wrote: > On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote: > > On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote: > > > >> 5. ...? > > > >> > > > > What about moving posting of receive buffers into main thread. With > > > > SRQ it is easy: don't post anything in CPC thread. Main thread will > > > > prepost buffers automatically after first fragment received on the > > > > endpoint (in btl_openib_handle_incoming()). > > > It still doesn't guaranty that we will not see RNR (as I understand we > > > trying to resolve this problem for iwarp?!) > > > > > I don't think that iwarp has SRQ at all. And if it has then it should > > While Chelsio does not currently have an adapter that has SRQs, there are > some other iWARP vendors that do have them. > > > have HW flow control for it too. I don't see what advantage SRQ without > > flow control can provide over PPRQ. > > Technically, this is not flow control, it is a retransmit. iWARP can use > the HW TCP stack to retransmit, but it will not have the "retransmit > forever" ability that setting rnr_retry to 7 has for IB. For how long will it try to retransmit before dropping connection. > > > > So this solution will cost 1 buffer on each srq ... sounds acceptable > > > for me. But I don't see too much > > > difference compared to #1, as I understand we anyway will be need the > > > pipe for communication with main thread. > > > so why don't use #1 ? > > What communication? No communication at all. Just don't prepost buffers > > to SRQ during connection establishment. Problem solved (only for SRQ of > > cause). > > iWARP needs preposted recv buffers (or it will drop the connection). So > this isn't a good option. I was talking about SRQ only. You said above that iwarp does retransmit for SRQ. openib BTL relies on HW retransmit when using SRQ, so if iwarp doesn't do it reliably enough it can not be used with SRQ anyway. -- Gleb.
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote: > On May 19, 2008, at 8:25 AM, Gleb Natapov wrote: > > > Is it possible to have sane SRQ implementation without HW flow > > control? > > It seems pretty unlikely if the only available HW flow control is to > terminate the connection. ;-) > > >> Even if we can get the iWARP semantics to work, this feels kinda > >> icky. Perhaps I'm overreacting and this isn't a problem that needs > >> to > >> be fixed -- after all, this situation is no different than what > >> happens after the initial connection, but it still feels icky. > > What is so icky about it? Sender is faster than a receiver so flow > > control > > kicks in. > > My point is that we have no real flow control for SRQ. > > >> 2. The CM progress thread posts its own receive buffers when creating > >> a QP (which is a necessary step in both CMs). However, this is > >> problematic in two cases: > >> > > [skip] > > > > I don't like 1,2 and 3. :( > > > >> 4. Have a separate mpool for drawing initial receive buffers for the > >> CM-posted RQs. We'd probably want this mpool to be always empty (or > >> close to empty) -- it's ok to be slow to allocate / register more > >> memory when a new connection request arrives. The memory obtained > >> from this mpool should be able to be returned to the "main" mpool > >> after it is consumed. > > > > This is slightly better, but still... > > Agreed; my reactions were pretty much the same as yours. > > >> 5. ...? > > What about moving posting of receive buffers into main thread. With > > SRQ it is easy: don't post anything in CPC thread. Main thread will > > prepost buffers automatically after first fragment received on the > > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more > > complicated. What if we'll prepost dummy buffers (not from free list) > > during IBCM connection stage and will run another three way handshake > > protocol using those buffers, but from the main thread. We will need > > to > > prepost one buffer on the active side and two buffers on the passive > > side. > > > This is probably the most viable alternative -- it would be easiest if > we did this for all CPC's, not just for IBCM: > > - for PPRQ: CPCs only post a small number of receive buffers, suitable > for another handshake that will run in the upper-level openib BTL > - for SRQ: CPCs don't post anything (because the SRQ already "belongs" > to the upper level openib BTL) > > Do we have a BSRQ restriction that there *must* be at least one PPRQ? > If so, we could always run the upper-level openib BTL really-post-the- > buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., > have the CPC post a single receive on this QP -- see below), which > would make things much easier. If we don't already have this > restriction, would we mind adding it? We have one PPRQ in our default > receive_queues value, anyway. > > With this rationale, once the CPC says "ok, all BSRQ QP's are > connected", then _endpoint.c can run a CTS handshake to post the > "real" buffers, where each side does the following: > > - CPC calls _endpoint_connected() to tell the upper level BTL that it > is fully connected (the function is invoked in the main thread) > - _endpoint_connected() posts all the "real" buffers to all the BSRQ > QP's on the endpoint > - _endpoint_connected() then sends a CTS control message to remote > peer via smallest RC PPRQ > - upon receipt of CTS: >- release the buffer (***) >- set endpoint state of CONNECTED and let all pending messages > flow... (as it happens today) > > So it actually doesn't even have to be a handshake -- it's just an > additional CTS sent over the newly-created RC QP. Since it's RC, we > don't have to do much -- just wait for the CTS to know that the remote > side has actually posted all the receives that we expect it to have. > Since the CTS flows over a PPRQ, there's no issue about receiving the > CTS on an SRQ (because the SRQ may not have any buffers posted at any > given time). > > (***) The CTS can even be a zero byte message (maybe with inline data > if we need it?); we're just waiting for the *first* message to arrive > on the smallest BSRQ PPQP. Here's a dumb question (because I've never > tried it and am on a plane where I can't try it) -- can you post a 0 > byte buffer (or NULL) for a receive? This would make returning the > buffer to the CPC much easier (i.e., you won't have to) because the > CPC [thread] will post the receive, but the upper level openib BTL > [main thread] will actually receive it. > > We still have to solve what happens with iWARP on SRQ's, but that's > really a different issue. I don't know if the iWARP vendors have > thought about this much yet...? I like the idea of the cpc only posting enough buffers to handle its connection setup. This sounds the most optimal for RDMACM, and there can even be HW specifi
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 05:17:57PM +0300, Gleb Natapov wrote: > On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote: > > >> 5. ...? > > >> > > > What about moving posting of receive buffers into main thread. With > > > SRQ it is easy: don't post anything in CPC thread. Main thread will > > > prepost buffers automatically after first fragment received on the > > > endpoint (in btl_openib_handle_incoming()). > > It still doesn't guaranty that we will not see RNR (as I understand we > > trying to resolve this problem for iwarp?!) > > > I don't think that iwarp has SRQ at all. And if it has then it should While Chelsio does not currently have an adapter that has SRQs, there are some other iWARP vendors that do have them. > have HW flow control for it too. I don't see what advantage SRQ without > flow control can provide over PPRQ. Technically, this is not flow control, it is a retransmit. iWARP can use the HW TCP stack to retransmit, but it will not have the "retransmit forever" ability that setting rnr_retry to 7 has for IB. > > So this solution will cost 1 buffer on each srq ... sounds acceptable > > for me. But I don't see too much > > difference compared to #1, as I understand we anyway will be need the > > pipe for communication with main thread. > > so why don't use #1 ? > What communication? No communication at all. Just don't prepost buffers > to SRQ during connection establishment. Problem solved (only for SRQ of > cause). iWARP needs preposted recv buffers (or it will drop the connection). So this isn't a good option. Thanks, Jon > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Threaded progress for CPCs
On May 19, 2008, at 8:25 AM, Gleb Natapov wrote: Is it possible to have sane SRQ implementation without HW flow control? It seems pretty unlikely if the only available HW flow control is to terminate the connection. ;-) Even if we can get the iWARP semantics to work, this feels kinda icky. Perhaps I'm overreacting and this isn't a problem that needs to be fixed -- after all, this situation is no different than what happens after the initial connection, but it still feels icky. What is so icky about it? Sender is faster than a receiver so flow control kicks in. My point is that we have no real flow control for SRQ. 2. The CM progress thread posts its own receive buffers when creating a QP (which is a necessary step in both CMs). However, this is problematic in two cases: [skip] I don't like 1,2 and 3. :( 4. Have a separate mpool for drawing initial receive buffers for the CM-posted RQs. We'd probably want this mpool to be always empty (or close to empty) -- it's ok to be slow to allocate / register more memory when a new connection request arrives. The memory obtained from this mpool should be able to be returned to the "main" mpool after it is consumed. This is slightly better, but still... Agreed; my reactions were pretty much the same as yours. 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. This is probably the most viable alternative -- it would be easiest if we did this for all CPC's, not just for IBCM: - for PPRQ: CPCs only post a small number of receive buffers, suitable for another handshake that will run in the upper-level openib BTL - for SRQ: CPCs don't post anything (because the SRQ already "belongs" to the upper level openib BTL) Do we have a BSRQ restriction that there *must* be at least one PPRQ? If so, we could always run the upper-level openib BTL really-post-the- buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e., have the CPC post a single receive on this QP -- see below), which would make things much easier. If we don't already have this restriction, would we mind adding it? We have one PPRQ in our default receive_queues value, anyway. With this rationale, once the CPC says "ok, all BSRQ QP's are connected", then _endpoint.c can run a CTS handshake to post the "real" buffers, where each side does the following: - CPC calls _endpoint_connected() to tell the upper level BTL that it is fully connected (the function is invoked in the main thread) - _endpoint_connected() posts all the "real" buffers to all the BSRQ QP's on the endpoint - _endpoint_connected() then sends a CTS control message to remote peer via smallest RC PPRQ - upon receipt of CTS: - release the buffer (***) - set endpoint state of CONNECTED and let all pending messages flow... (as it happens today) So it actually doesn't even have to be a handshake -- it's just an additional CTS sent over the newly-created RC QP. Since it's RC, we don't have to do much -- just wait for the CTS to know that the remote side has actually posted all the receives that we expect it to have. Since the CTS flows over a PPRQ, there's no issue about receiving the CTS on an SRQ (because the SRQ may not have any buffers posted at any given time). (***) The CTS can even be a zero byte message (maybe with inline data if we need it?); we're just waiting for the *first* message to arrive on the smallest BSRQ PPQP. Here's a dumb question (because I've never tried it and am on a plane where I can't try it) -- can you post a 0 byte buffer (or NULL) for a receive? This would make returning the buffer to the CPC much easier (i.e., you won't have to) because the CPC [thread] will post the receive, but the upper level openib BTL [main thread] will actually receive it. We still have to solve what happens with iWARP on SRQ's, but that's really a different issue. I don't know if the iWARP vendors have thought about this much yet...? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 07:39:13PM +0300, Pavel Shamis (Pasha) wrote: So this solution will cost 1 buffer on each srq ... sounds acceptable for me. But I don't see too much difference compared to #1, as I understand we anyway will be need the pipe for communication with main thread. so why don't use #1 ? >>> What communication? No communication at all. Just don't prepost buffers >>> to SRQ during connection establishment. Problem solved (only for SRQ of >>> cause). > As i know Jeff use the pipe for some status update (Jeff, please correct > me if I wrong). > If we still need pipe for communication , I prefer #1. > If we don't have the pipe , I prefer your solution > The pipe will still be there. The pipe itself is not the problem. The problem is that currently initial post_receives are done in the CPC thread. post_receives involves access to some data structures that are used in the main thread too (free lists, mpool, SRQ) so it has to be either protected or eliminated. I think that eliminating it is a better solution for now. For SRQ case it is also easy to do. PPRQ is more complicated but IMHO possible. -- Gleb.
Re: [OMPI devel] Threaded progress for CPCs
What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). It still doesn't guaranty that we will not see RNR (as I understand we trying to resolve this problem for iwarp?!) I don't think that iwarp has SRQ at all. And if it has then it should have HW flow control for it too. I don't see what advantage SRQ without flow control can provide over PPRQ. I'm agree that HW flow it is no reason for SRQ. So this solution will cost 1 buffer on each srq ... sounds acceptable for me. But I don't see too much difference compared to #1, as I understand we anyway will be need the pipe for communication with main thread. so why don't use #1 ? What communication? No communication at all. Just don't prepost buffers to SRQ during connection establishment. Problem solved (only for SRQ of cause). As i know Jeff use the pipe for some status update (Jeff, please correct me if I wrong). If we still need pipe for communication , I prefer #1. If we don't have the pipe , I prefer your solution Pasha
Re: [OMPI devel] Threaded progress for CPCs
On Mon, May 19, 2008 at 05:08:17PM +0300, Pavel Shamis (Pasha) wrote: > >> 5. ...? > >> > > What about moving posting of receive buffers into main thread. With > > SRQ it is easy: don't post anything in CPC thread. Main thread will > > prepost buffers automatically after first fragment received on the > > endpoint (in btl_openib_handle_incoming()). > It still doesn't guaranty that we will not see RNR (as I understand we > trying to resolve this problem for iwarp?!) > I don't think that iwarp has SRQ at all. And if it has then it should have HW flow control for it too. I don't see what advantage SRQ without flow control can provide over PPRQ. > So this solution will cost 1 buffer on each srq ... sounds acceptable > for me. But I don't see too much > difference compared to #1, as I understand we anyway will be need the > pipe for communication with main thread. > so why don't use #1 ? What communication? No communication at all. Just don't prepost buffers to SRQ during connection establishment. Problem solved (only for SRQ of cause). -- Gleb.
Re: [OMPI devel] Threaded progress for CPCs
1. When CM progress thread completes an incoming connection, it sends a command down a pipe to the main thread indicating that a new endpoint is ready to use. The pipe message will be noticed by opal_progress() in the main thread and will run a function to do all necessary housekeeping (sets the endpoint state to CONNECTED, etc.). But it is possible that the receiver process won't dip into the MPI layer for a long time (and therefore not call opal_progress and the housekeeping function). Therefore, it is possible that with an active sender and a slow receiver, the sender can overwhelm an SRQ. On IB, this will just generate RNRs and be ok (we configure SRQs to have infinite RNRs), but I don't understand the semantics of what will happen on iWARP (it may terminate? I sent an off-list question to Steve Wise to ask for detail -- we may have other issues with SRQ on iWARP if this is the case, but let's skip that discussion for now). Is it possible to have sane SRQ implementation without HW flow control? Anyway the described problem exists with SRQ right now too. If receiver doesn't enter progress for a long time sender can overwhelm an SRQ. I don't see how this can be fixed without progress thread (and I am not even sure that this is the problem that has to be fixed). It may be resolved particularly by srq_limit_event (this event is generated when number posted receive buffer come down under predefined watermark ) But I'm not sure that we want to move the RNR problem from sender side to receiver. The full solution will be progress thread + srq_limit_event. Even if we can get the iWARP semantics to work, this feels kinda icky. Perhaps I'm overreacting and this isn't a problem that needs to be fixed -- after all, this situation is no different than what happens after the initial connection, but it still feels icky. What is so icky about it? Sender is faster than a receiver so flow control kicks in. 2. The CM progress thread posts its own receive buffers when creating a QP (which is a necessary step in both CMs). However, this is problematic in two cases: [skip] I don't like 1,2 and 3. :( If Iwarp may handle RNR , #1 - sounds ok for me, at least for 1.3. 4. Have a separate mpool for drawing initial receive buffers for the CM-posted RQs. We'd probably want this mpool to be always empty (or close to empty) -- it's ok to be slow to allocate / register more memory when a new connection request arrives. The memory obtained from this mpool should be able to be returned to the "main" mpool after it is consumed. This is slightly better, but still... 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). It still doesn't guaranty that we will not see RNR (as I understand we trying to resolve this problem for iwarp?!) So this solution will cost 1 buffer on each srq ... sounds acceptable for me. But I don't see too much difference compared to #1, as I understand we anyway will be need the pipe for communication with main thread. so why don't use #1 ? With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Threaded progress for CPCs
On Sun, May 18, 2008 at 11:38:36AM -0400, Jeff Squyres wrote: > ==> Remember that the goal for this work was to have a separate > progress thread *without* all the heavyweight OMPI thread locks. > Specifically: make it work in a build without --enable-progress- > threads or --enable-mpi-threads (we did some preliminary testing with > that stuff enabled and it had a big performance impact). > > 1. When CM progress thread completes an incoming connection, it sends > a command down a pipe to the main thread indicating that a new > endpoint is ready to use. The pipe message will be noticed by > opal_progress() in the main thread and will run a function to do all > necessary housekeeping (sets the endpoint state to CONNECTED, etc.). > But it is possible that the receiver process won't dip into the MPI > layer for a long time (and therefore not call opal_progress and the > housekeeping function). Therefore, it is possible that with an active > sender and a slow receiver, the sender can overwhelm an SRQ. On IB, > this will just generate RNRs and be ok (we configure SRQs to have > infinite RNRs), but I don't understand the semantics of what will > happen on iWARP (it may terminate? I sent an off-list question to > Steve Wise to ask for detail -- we may have other issues with SRQ on > iWARP if this is the case, but let's skip that discussion for now). > Is it possible to have sane SRQ implementation without HW flow control? Anyway the described problem exists with SRQ right now too. If receiver doesn't enter progress for a long time sender can overwhelm an SRQ. I don't see how this can be fixed without progress thread (and I am not even sure that this is the problem that has to be fixed). > Even if we can get the iWARP semantics to work, this feels kinda > icky. Perhaps I'm overreacting and this isn't a problem that needs to > be fixed -- after all, this situation is no different than what > happens after the initial connection, but it still feels icky. What is so icky about it? Sender is faster than a receiver so flow control kicks in. > > 2. The CM progress thread posts its own receive buffers when creating > a QP (which is a necessary step in both CMs). However, this is > problematic in two cases: > [skip] I don't like 1,2 and 3. :( > 4. Have a separate mpool for drawing initial receive buffers for the > CM-posted RQs. We'd probably want this mpool to be always empty (or > close to empty) -- it's ok to be slow to allocate / register more > memory when a new connection request arrives. The memory obtained > from this mpool should be able to be returned to the "main" mpool > after it is consumed. This is slightly better, but still... > 5. ...? What about moving posting of receive buffers into main thread. With SRQ it is easy: don't post anything in CPC thread. Main thread will prepost buffers automatically after first fragment received on the endpoint (in btl_openib_handle_incoming()). With PPRQ it's more complicated. What if we'll prepost dummy buffers (not from free list) during IBCM connection stage and will run another three way handshake protocol using those buffers, but from the main thread. We will need to prepost one buffer on the active side and two buffers on the passive side. -- Gleb.
[OMPI devel] Threaded progress for CPCs
Sorry for the length of this mail. It's a complex issue. :-\ I did everything needed to enable the IB and RDMA CM's to have their own progress threads to handle incoming CM traffic (which is important because both CM's have timeouts for all their communications) and it seems to be working fine for simple examples. I posted an hg of this work (regularly kept in sync with the trunk): http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/openib-fd- progress/ But in talking to Pasha today, we realized that there are big problems which will undoubtedly show up when running more than trivial examples. ==> Remember that the goal for this work was to have a separate progress thread *without* all the heavyweight OMPI thread locks. Specifically: make it work in a build without --enable-progress- threads or --enable-mpi-threads (we did some preliminary testing with that stuff enabled and it had a big performance impact). 1. When CM progress thread completes an incoming connection, it sends a command down a pipe to the main thread indicating that a new endpoint is ready to use. The pipe message will be noticed by opal_progress() in the main thread and will run a function to do all necessary housekeeping (sets the endpoint state to CONNECTED, etc.). But it is possible that the receiver process won't dip into the MPI layer for a long time (and therefore not call opal_progress and the housekeeping function). Therefore, it is possible that with an active sender and a slow receiver, the sender can overwhelm an SRQ. On IB, this will just generate RNRs and be ok (we configure SRQs to have infinite RNRs), but I don't understand the semantics of what will happen on iWARP (it may terminate? I sent an off-list question to Steve Wise to ask for detail -- we may have other issues with SRQ on iWARP if this is the case, but let's skip that discussion for now). Even if we can get the iWARP semantics to work, this feels kinda icky. Perhaps I'm overreacting and this isn't a problem that needs to be fixed -- after all, this situation is no different than what happens after the initial connection, but it still feels icky. 2. The CM progress thread posts its own receive buffers when creating a QP (which is a necessary step in both CMs). However, this is problematic in two cases: - If posting to an SRQ, the main thread may also be [re-]posting to the SRQ at the same time. Those endpoint data structures therefore need to be protected. - All receive buffers come from the mpool, and therefore those data structures need to be protected. Specifically: both threads may post to the SRQ simultaneously, but the CM will always be the first to post to a PPRQ. So although there's no race in the PPRQ endpoint data structures, there is a potential for race issues in the mpool data structures in both cases. This is all a problem because we explicitly do not want to enable *all* the heavyweight threading infrastructure for OMPI. I see a few options, none of which seem attractive: 1. Somehow make it so only mpool and select other portions of OMPI can have threading/lock support (although this seems like a slippery slope -- I can foresee implications that would make it completely meaningless to only have some thread locks enabled and not others). This is probably the least attractive option. 2. Make the IB and RDMA CM requests be tolerant of timing out (and just restarting). This is actually a lot of work; for example, the IBCM CPC would then need to be tolerant of timing out anywhere in its 3-way handshake and starting over again. This could have serious implications for when a connection will be able to actually complete if a receiver rarely dips into the MPI layer (much worse than RDMA CM's 2-way handshake). 3. Have locks around the critical areas described in #1 that can be enabled without --enable--threads support (perhaps disabled at run time if we're not using a CM progress thread?). 4. Have a separate mpool for drawing initial receive buffers for the CM-posted RQs. We'd probably want this mpool to be always empty (or close to empty) -- it's ok to be slow to allocate / register more memory when a new connection request arrives. The memory obtained from this mpool should be able to be returned to the "main" mpool after it is consumed. 5. ...? Thoughts? -- Jeff Squyres Cisco Systems