Re: mlx4 missing completions (was Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1)

2011-10-27 Thread Bart Van Assche
On Wed, Oct 26, 2011 at 10:06 PM, Roland Dreier rol...@purestorage.com wrote:
 On Wed, Oct 26, 2011 at 11:05 AM, Bart Van Assche bvanass...@acm.org wrote:
  Can I conclude from your reply that the last WQE event refers to the
  SRQ only and that it does not provide any information about the send
  queue associated with the same QP ?

 That is correct as I understand things.  The last WQE event was added
 to handle the fact that if you have a QP associated to an SRQ, you can't
 know how many receives have been taken from the SRQ and are in flight
 on the receive queue.  The IB spec is a bit vague on what last WQE
 actually means but based on the background, I believe it means that the
 last request on the receive queue has been processed, without saying
 anything about the send queue.

Are you sure about this ? There is a paragraph in the InfiniBand
Architecture Manual that can only be correct if last WQE refers to
both SRQ and send queue and not to the SRQ only. It is mentioned in
that manual that it is safe to reset the QP immediately after the last
WQE event has been received. A quote:

Note, for QPs that are associated with an SRQ, the Consumer should
take the QP through the Error State before invoking a Destroy QP or a
Modify QP to the Reset State. The Consumer may invoke the Destroy QP
without first performing a Modify QP to the Error State and waiting
for the Affiliated Asynchronous Last WQE Reached Event. However, if
the Consumer does not wait for the Affiliated Asynchronous Last WQE
Reached Event, then WQE and Data Segment leakage may occur. Therefore,
it is good programming practice to tear down a QP that is associated
with an SRQ by using the following process:
• Put the QP in the Error State;
• wait for the Affiliated Asynchronous Last WQE Reached Event;
• either:
• drain the CQ by invoking the Poll CQ verb and either wait for CQ to
be empty or the number of Poll CQ operations has exceeded CQ capacity
size; or
• post another WR that completes on the same CQ and wait for this WR
to return as a WC;
• and then invoke a Destroy QP or Reset QP.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1

2011-10-26 Thread Bart Van Assche
On Mon, Oct 24, 2011 at 7:57 AM, Nicholas A. Bellinger
n...@linux-iscsi.org wrote:
 +static int srpt_write_pending(struct se_cmd *se_cmd)
 +{
 +       struct srpt_rdma_ch *ch;
 +       struct srpt_send_ioctx *ioctx;
 +       enum srpt_command_state new_state;
 +       enum rdma_ch_state ch_state;
 +       int ret;
 +
 +       ioctx = container_of(se_cmd, struct srpt_send_ioctx, cmd);
 +
 +       new_state = srpt_set_cmd_state(ioctx, SRPT_STATE_NEED_DATA);
 +       WARN_ON(new_state == SRPT_STATE_DONE);
 +
 +       ch = ioctx-ch;
 +       BUG_ON(!ch);
 +
 +       ch_state = srpt_get_ch_state(ch);
 +       switch (ch_state) {
 +       case CH_CONNECTING:
 +               /* This code should never be reached. */
 +               __WARN();
 +               ret = -EINVAL;
 +               goto out;
 +       case CH_LIVE:
 +               break;
 +       case CH_DISCONNECTING:
 +       case CH_DRAINING:
 +       case CH_RELEASING:
 +               pr_debug(cmd with tag %lld: channel disconnecting\n,
 +                        ioctx-tag);
 +               srpt_set_cmd_state(ioctx, SRPT_STATE_DATA_IN);
 +               ret = -EINVAL;
 +               goto out;
 +       }
 +       ret = srpt_xfer_data(ch, ioctx);
 +
 +out:
 +       return ret;
 +}

If not enough memory is available to perform a SCSI write, this
function returns -ENOMEM and the LIO core will report that error to
the initiator. At least when mounting a filesystem on top of ib_srp
and when performing asynchronous I/O that will result in data loss.
Shouldn't the LIO core retry writes as long as not enough memory is
available ?

A similar argument holds for reads.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1

2011-10-26 Thread Nicholas A. Bellinger
On Wed, 2011-10-26 at 08:09 +0200, Bart Van Assche wrote:
 On Mon, Oct 24, 2011 at 7:57 AM, Nicholas A. Bellinger
 n...@linux-iscsi.org wrote:
  +static int srpt_write_pending(struct se_cmd *se_cmd)
  +{
  +   struct srpt_rdma_ch *ch;
  +   struct srpt_send_ioctx *ioctx;
  +   enum srpt_command_state new_state;
  +   enum rdma_ch_state ch_state;
  +   int ret;
  +
  +   ioctx = container_of(se_cmd, struct srpt_send_ioctx, cmd);
  +
  +   new_state = srpt_set_cmd_state(ioctx, SRPT_STATE_NEED_DATA);
  +   WARN_ON(new_state == SRPT_STATE_DONE);
  +
  +   ch = ioctx-ch;
  +   BUG_ON(!ch);
  +
  +   ch_state = srpt_get_ch_state(ch);
  +   switch (ch_state) {
  +   case CH_CONNECTING:
  +   /* This code should never be reached. */
  +   __WARN();
  +   ret = -EINVAL;
  +   goto out;
  +   case CH_LIVE:
  +   break;
  +   case CH_DISCONNECTING:
  +   case CH_DRAINING:
  +   case CH_RELEASING:
  +   pr_debug(cmd with tag %lld: channel disconnecting\n,
  +ioctx-tag);
  +   srpt_set_cmd_state(ioctx, SRPT_STATE_DATA_IN);
  +   ret = -EINVAL;
  +   goto out;
  +   }
  +   ret = srpt_xfer_data(ch, ioctx);
  +
  +out:
  +   return ret;
  +}
 
 If not enough memory is available to perform a SCSI write, this
 function returns -ENOMEM and the LIO core will report that error to
 the initiator. At least when mounting a filesystem on top of ib_srp
 and when performing asynchronous I/O that will result in data loss.
 Shouldn't the LIO core retry writes as long as not enough memory is
 available ?
 
 A similar argument holds for reads.

So target_core_fabric_ops-write_pending(), -queue_data_in() and
-queue_status() currently check for a -EAGAIN return in order to signal
that QUEUE_FULL logic should kick in for the associated se_cmd
descriptor.

I think it's reasonable to also allow fabric modules to signal
QUEUE_FULL using -ENOMEM as well, so I'll go ahead and change this for
v3.2 w/ ib_srpt and will send out a patch shortly.

Also, I did notice one extra case in srpt_map_sg_to_ib_sge() -
srpt_xfer_data() that returns -EBUSY, so i'll change this  to -EAGAIN to
signal QUEUE_FULL as well.

Thanks for the feedback Bart!

--nab





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 missing completions (was Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1)

2011-10-26 Thread Bart Van Assche
On Wed, Oct 26, 2011 at 12:04 AM, Roland Dreier rol...@purestorage.com wrote:
 Sorry, but now I confused about what the bug is.  You have a QP
 associated with an SRQ, and you transition the QP to error.  At some
 point you get a last WQE received event for that QP (which means
 all receive requests for that QP have completed), and you drain the
 CQ after that, which means you are guaranteed to have processed
 all the receive requests for that QP.  (At least this is how the verbs
 are supposed to work).

 I don't think there is any way to guarantee that every request posted
 to the SRQ is taken by a QP.  You just have to make sure that you
 tear down every QP as above, and then you know when there are
 no more QPs associated to the SRQ that any receive requests that
 haven't completed are still on the SRQ, and you can free their
 resources after you destroy the SRQ.

I've learned something since I posted the message at the start of this
thread: all error completions are posted on the CQ but draining the CQ
immediately after having received the last WQE event is not
sufficient. The CQ polling loop has to remain active a little longer
in order to receive all receive and send error completions. Issuing
rmmod ib_srpt during I/O works now without delay for SVN trunk
r3900.

I've had another look at the LIO version of ib_srpt and the way it
handles completions should ensure that all error completions get
processed (if any such completions are generated of course).

Can I conclude from your reply that the last WQE event refers to the
SRQ only and that it does not provide any information about the send
queue associated with the same QP ?

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1

2011-10-25 Thread Bart Van Assche
On Mon, Oct 24, 2011 at 8:33 PM, Nicholas A. Bellinger
n...@linux-iscsi.org wrote:
 It would have been nice if you bothered to mention these in any of the
 pre merge window reviews, but given your delay responsed on these items
 they will have to be something we'll fix post merge.

Sorry for that - it took some time though before I remembered myself
about all this.

If you are looking for a test case: pulling an IB cable during I/O a
few times should allow to trigger this.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1

2011-10-25 Thread Bart Van Assche
On Mon, Oct 24, 2011 at 11:01 PM, Roland Dreier rol...@purestorage.com wrote:
 On Mon, Oct 24, 2011 at 11:33 AM, Nicholas A. Bellinger
 n...@linux-iscsi.org wrote:
 - Handle IB completion timeouts. Although the InfiniBand Architecture
 Manual specifies that a HCA must generate an error completion for each
 pending work item when a queue pair reset is issued, an important
 class of HCAs doesn't do this (that includes the HCA in your test
 setup). In the SCST version of this driver such timeouts are handled
 by the function srpt_pending_cmd_timeout().

 Hold on... not sure what you mean by a queue pair reset but the IB
 spec does not say that a flush error should be generated when a QP is
 transitioned to the reset state.  The flush error completions are
 required when a QP is transitioned to the error state, and in fact
 completion entries may be lost when a QP transitions to reset.

I was referring to transitioning a QP to the error state -- see also
srpt_ch_qp_err().

 As far as I know every HCA supported by Linux does implement this
 correctly.  Which class did you have in mind as not doing that?

At least QDR ConnectX 2 HCAs with fairly recent firmware. This
behavior can be reproduced easily with the SCST version of ib_srpt as
follows:
- Log in from an SRP initiator to ib_srpt.
- Start a direct I/O read test, e.g. with fio.
- Issue the command rmmod ib_srpt on the target during I/O.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mlx4 missing completions (was Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1)

2011-10-25 Thread Roland Dreier
On Mon, Oct 24, 2011 at 11:07 PM, Bart Van Assche bvanass...@acm.org wrote:
 As far as I know every HCA supported by Linux does implement this
 correctly.  Which class did you have in mind as not doing that?

 At least QDR ConnectX 2 HCAs with fairly recent firmware. This
 behavior can be reproduced easily with the SCST version of ib_srpt as
 follows:
 - Log in from an SRP initiator to ib_srpt.
 - Start a direct I/O read test, e.g. with fio.
 - Issue the command rmmod ib_srpt on the target during I/O.

OK, this is a pretty serious bug in mlx4_ib if true.  Are
you sure that you really are seeing some pending work
requests not generating a flush error when the QP transitions
to the error state?

Keep in mind that transitioning the QP to the reset state
can cause completion entries to disappear after they
are generated.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 missing completions (was Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1)

2011-10-25 Thread Bart Van Assche
On Tue, Oct 25, 2011 at 1:17 PM, Roland Dreier rol...@purestorage.com wrote:
 On Mon, Oct 24, 2011 at 11:07 PM, Bart Van Assche bvanass...@acm.org wrote:
  As far as I know every HCA supported by Linux does implement this
  correctly.  Which class did you have in mind as not doing that?
 
  At least QDR ConnectX 2 HCAs with fairly recent firmware. This
  behavior can be reproduced easily with the SCST version of ib_srpt as
  follows:
  1. Log in from an SRP initiator to ib_srpt.
  2. Start a direct I/O read test, e.g. with fio.
  3. Issue the command rmmod ib_srpt on the target during I/O.

 OK, this is a pretty serious bug in mlx4_ib if true.  Are
 you sure that you really are seeing some pending work
 requests not generating a flush error when the QP transitions
 to the error state?

It's a little more complex than that. The original version of ib_srpt
stops polling for completions as soon as the last WQE event has been
received and after that the queue has been drained. So I don't know
whether these flush errors were not delivered or whether these were
delivered too late.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 missing completions (was Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1)

2011-10-25 Thread Roland Dreier
On Tue, Oct 25, 2011 at 11:09 AM, Bart Van Assche bvanass...@acm.org wrote:
 It's a little more complex than that. The original version of ib_srpt
 stops polling for completions as soon as the last WQE event has been
 received and after that the queue has been drained. So I don't know
 whether these flush errors were not delivered or whether these were
 delivered too late.

Sorry, but now I confused about what the bug is.  You have a QP
associated with an SRQ, and you transition the QP to error.  At some
point you get a last WQE received event for that QP (which means
all receive requests for that QP have completed), and you drain the
CQ after that, which means you are guaranteed to have processed
all the receive requests for that QP.  (At least this is how the verbs
are supposed to work).

So what goes wrong with ConnectX?

I don't think there is any way to guarantee that every request posted
to the SRQ is taken by a QP.  You just have to make sure that you
tear down every QP as above, and then you know when there are
no more QPs associated to the SRQ that any receive requests that
haven't completed are still on the SRQ, and you can free their
resources after you destroy the SRQ.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2] ib_srpt: Initial SRP Target merge for v3.2-rc1

2011-10-24 Thread Roland Dreier
On Mon, Oct 24, 2011 at 11:33 AM, Nicholas A. Bellinger
n...@linux-iscsi.org wrote:
 - Handle IB completion timeouts. Although the InfiniBand Architecture
 Manual specifies that a HCA must generate an error completion for each
 pending work item when a queue pair reset is issued, an important
 class of HCAs doesn't do this (that includes the HCA in your test
 setup). In the SCST version of this driver such timeouts are handled
 by the function srpt_pending_cmd_timeout().

Hold on... not sure what you mean by a queue pair reset but the IB
spec does not say that a flush error should be generated when a QP is
transitioned to the reset state.  The flush error completions are
required when a QP is transitioned to the error state, and in fact
completion entries may be lost when a QP transitions to reset.

As far as I know every HCA supported by Linux does implement this
correctly.  Which class did you have in mind as not doing that?

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html