> I’ve been debugging my way through the sockets provider for the last few days
> because I
> am having problems using FI_MULTI_RECV buffers. The code seems to be doing
> what the
> man pages say it should do, but I can’t see how to recover from RX completion
> queue
> errors when they occur. For this discussion I am just referencing
> difficulties
> processing a FI_ETRUNC error, I haven’t yet considered other errors. And I
> haven’t
> yet tried other providers like TCP+RDM, that’s probably the next step, but in
> the
> meantime I’d like to have folks be aware of this.
>
> I know about fi_setopt(…, FI_OPT_MIN_MULTI_RECV,…) command, but that doesn’t
> completely
> solve my problem, it is only useful for avoiding message truncation if all of
> your
> messages tend to be the same size, and you can predict the largest message
> you might
> ever receive, and then set the contents of the function’s ‘opt_val’
> parameter to that
> size. And then in order to make multi-recv buffers efficient, the buffers
> should be
> many times that size. That’s not always possible.
>
> First mystery: I have many FI_MULTI_RECV buffers queued on the server’s
> endpoint, yet
> the sockets provider will truncate a message in order to fit it into the
> multi-recv
> buffer at the head of the list. Why deliberately destroy a message when
> there are
> follow-on buffers large enough to hold that message? The code could instead
> create a
> FI_MULTI_RECV completion event for the current buffer, then place the new
> message in
> the next buffer and avoid any truncation. Is there something I am missing
> here?
The intent is to support as many potential implementations as possible. You
could argue that the message should go to the next posted buffer. But what if
there isn't another posted buffer? Or what if the next posted buffer is
smaller than the message? Should we require the implementation to check all
buffers in the list looking for one that fits? Should it look for the best
fit? Should it queue the message into an unexpected list? What about the
buffers that were skipped in the search? Does the provider flush them, or keep
them for future messages? This is non-trivial.
Reporting ETRUNC is a valid option. Although a provider could move to the next
buffer and place the message there. I think that would be supported by the
API. Anything beyond that is basically undefined.
Sockets takes the unforgiving approach. We need to be careful setting a
precedent on what future hardware must do. (I'm not aware of hardware that
supports multi-receive directly, though some portals based NICs might.)
You might want to look at variable length messages as an alternative for
supporting messages with widely differing sizes.
> The fi_cq_err_entry struct for the error is produced by
> sock_cq_report_error(), which
> is called by sock_pe_report_rx_error(), which is called from
> sock_pe_process_rx_send().
> Note that the fi_cq_err_entry structure does not contain a fi_addr_t field,
> so the
> FI_SOURCE address of the client who sent the truncated message is lost, there
> is no way
> to identify the affected client and send it a “please-send-msg-XXX-again”
> message. I
> don’t see any solution for this.
Hmm... this should be fixed. I'm not sure how yet.
> There is nothing within the fi_cq_err_entry to identify which message from
> the client
> was lost either. Here is an example of an error about to be posted on the
> socket
> completion queue’s ring buffer by sock_cq_report_error(), both in decimal and
> hex
> format. The flags field corresponds to FI_MULTI_RECV|FI_READ|FI_MSG. In
> my case I
> am not using ‘data’ or ‘tag’, many messages have length of 96, there is
> nothing unique
> here to identify the message.
>
>
>
> (gdb) p errbuf
>
> $9 = {op_context = 0x7fffe8000b40, flags = 66562, len = 96, buf = 0x0, data =
> 1365, tag
> = 0, olen = 4, err = 265, prov_errno = -265, err_data = 0x20000,
> err_data_size = 0}
>
> (gdb) p/x errbuf
>
> $10 = {op_context = 0x7fffe8000b40, flags = 0x10402, len = 0x60, buf = 0x0,
> data =
> 0x555, tag = 0x0, olen = 0x4, err = 0x109, prov_errno = 0xfffffef7, err_data
> = 0x20000,
> err_data_size = 0x0}
>
>
>
> Notice that the ‘buf’ field is zero, so you can’t even find the data that was
> copied
> into the buffer up to the point of the truncation. That ‘buf’ value is
> coming from
> sock_cq_report_error() in the following piece of code, it is taking the ‘if’
> branch:
>
>
>
> if (entry->type == SOCK_PE_RX)
>
> err_entry.buf = (void *) (uintptr_t)
> entry->pe.rx.rx_iov[0].iov.addr;
>
> else
>
> err_entry.buf = (void *) (uintptr_t) entry-
> >pe.tx.tx_iov[0].src.iov.addr;
>
>
>
> I suspect that this a bug, because in sock_pe_process_rx_send() I can dump
> the multi-
> recv buffer and see that bytes were in fact copied into the buffer up to the
> truncation
> point. If ‘buf’ was set to the correct value, then at least I might be able
> to parse
> the beginning of the truncated message to find a message ID, which if I could
> then
> somehow also get the client’s fi_addr_t, would allow me to send a
> “please-send-msg-XXX-
> again” message to that client.
>
>
>
> Note that the FI_MULTI_RECV flag is set in the fi_cq_err_entry, but because
> ‘buf’ is
> zero, there is no way to identify which multi-recv buffer is full! In my
> case, I
> happen to use the op_context pointer as a place to record the address of the
> multi-recv
> buffer in which the message landed, so I know which buffer is full. But
> this is still
> awkward, this completion error is scores of completions into the future from
> where I am
> currently reading with fi_cq_sread(), I haven’t even gotten to any
> completions which
> are in that buffer yet. If I recycle the buffer immediately based on the
> FI_MULTI_RECV flag, I will destroy the data for all the upcoming completions
> that I
> haven’t yet read.
Multi-receive completions should always return the op_context associated with
the posting of the multi-receive buffer. The buf pointer is an offset into
that buffer. But the intent is that the app can identify the multi-receive
buffer from op_context, not buf.
I'm not following the issue with previous/future completions. If the app is
actively using the buffer, I would look at maintaining a reference count for
when the buffer can be reposted. I'm not sure what the exact problem is or how
libfabric can do anything different. The use of the buffer is outside of its
scope.
By the time FI_MULTI_RECV is set on a completion, no additional completions
will be generated for that buffer. At least that is how is should work. If
not, this sounds like a bug in the provider. Error completions should be
reported in order with non-error completions.
> So, how is one supposed to process this error? Assuming ‘buf’ is corrected
> to point
> to the truncated message, I would first have to use it to deduce the identity
> of the
> multi-recv buffer that contains the error (if I wasn’t using op_context).
> Then I would
> have to read and process completion events until I find an event that is in
> that same
> buffer. Then I would continue to process more events until I see an event
> that is NOT
> in that buffer, proving that I must have therefore finished processing all
> events that
> were in the buffer. At that point I could finally recycle the multi-recv
> buffer per
> the FI_MULTI_RECV flag. That assumes that there will be at least one
> completion
> event available for the buffer that follows the completed multi-recv buffer,
> but that
> might not be the case. Or it might be that the message that was truncated
> was the
> ONLY message to be placed in the completed buffer, in which case I’ll have no
> way of
> detecting when it is safe to recycle that multi-recv buffer. Hmmm, I guess
> if I keep
> a list of all the multi-recv buffers I’ve posted, and the order in which I’ve
> posted
> them, then I would know the address of the buffer following the completed
> buffer, and I
> could look for a completion event in that buffer. But still, these seems
> really
> complicated.
>
>
>
> If you haven’t already figured it out, I have been leading the conversation,
> I would
> like to suggest an alternate implementation that makes all these issues go
> away. It
> might be considered a slight change to the libfabric rules, so I suppose it
> needs to be
> more widely debated, but I’d like to at least start the discussion. Given
> that it is
> a change, other providers might also be affected, does anyone know if other
> providers
> implement this differently?
>
>
>
> If you look at the call to sock_pe_report_error(), all of the fields that get
> reported
> in the fi_cq_err_entry come from the pe_entry for the message. Rather than
> posting an
> error immediately via a ring buffer to the CQ, why not post the pe_entry, to
> the CQ as
> a normal completion event in the current time order, but have it be marked
> internally
> as an error? The application would then read completions from the CQ as
> usual, and it
> wouldn’t see any error until it read the completion event for the truncated
> message
> itself. So at that point the application’s call to fi_cq_sreadfrom() would
> fail with
> FI_EAVAIL. It knows that the fi_cq_err_entry it next reads using
> fi_cq_readerr will
> apply to that CQ entry that just got the error, so the FI_SOURCE fi_addr_t
> that was
> returned by fi_cq_sreadfrom() is still available, the app knows the identity
> of the
> truncated message’s client. And if ‘buf’ is fixed, it can parse the
> beginning of the
> truncated message to identify the particular message which it will ask the
> client to
> replay. Finally, since the FI_MULTI_RECV bit won’t show up in a completion
> event
> until the fi_cq_err_entry has been read, the app knows that all the messages
> in the
> expended buffer have already been processed, it can always immediately
> recycle the
> buffer whenever it first sees the bit set, with a minimum of code.
>
>
>
> This is the only way that I can see to prevent the client’s FI_SOURCE
> fi_addr_t from
> being lost during error processing. For now I guess I will have to implement
> message
> timeouts in the clients, such that if a message is not acknowledged within
> some period
> of time, then it must have been truncated and discarded, and to send it
> again. This
> is certainly a less desirable solution.
I thought about this option as well. I don't like needing to carry state
between two calls into the library. That is problematic for any multi-threaded
use case. The best option is for fi_cq_err_entry to convey the source address
somehow, even if we need to extend the structure.
- Sean
_______________________________________________
ofiwg mailing list
[email protected]
https://lists.openfabrics.org/mailman/listinfo/ofiwg