Dev, Vasu wrote:
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of [EMAIL PROTECTED]
Sent: Thursday, August 14, 2008 8:40 PM
To: [email protected]
Subject: [Open-FCoE] some fc exch reset fixes and a fix to my abort patches

These patches were made over my last abort patches and over the
current tree.

They fix a couple bugs in fc_exch.c and fc_fcp.c. Most are old bugs,
but one was caused by my abort changes. The fix for the bug I added
is also needed for the old bugs I am fixing, so I am including it in here
so I do not fill up the list by resending the old patchset. If you want
me to resend the old patches with this fix merged in it let me know
and I will do that instead.



This is fine and no need to resend patches in one merged set, I was able
to apply these patches after resolving very minor conflict due to recent
renaming of fc_fcp_recv().

I know I'm not responding to correct email/s thread for individual more
points below in this email but given so many of your floating patches,
I guess aggregate reply to all patches will expedite your patches
for inclusion to tree, therefore I'm writing all my points in this single 
response.

I took your recent 10 patches series and followed by 4 more patches up to
patch title as  "libfc: check ESB_ST_COMPLETE before completing ep (take 2)"
and then applied these 14 patches to commit.

57929ea0ac20b23f894a557f2b0842dc1ec7d804 Robert Love 5 days ago title:fcoe: 
Move protocol definitions fc_fcoe.h

These 14 patches works fine for basic testing :-) however not with
kicking error handling paths, more details below.

I tried a new error test to kick error handling paths with and without
your these 14 patches by dropping SCSI response to a SCSI read command.

With your 14 patches. My system locks up persistently after
single error test run, I mean system becomes non response without any
additional error messages though my kernel has spin lock debugging enabled.

So I tried this error test without your 14 patches, this time I ran
into exch cleanup issue which is likely due to messed up exch ref
in RRQ which you have fixed in patch 7/10. However this time multiple
error test runs works fine without causing any system lock up.

However wireshark trace looks good with or without your patches through REC,
REC RJT, ABTS, BA_ACC, RRQ and RRQ_ACC with my target when error test is
done.

So any idea why system locks up with your 14 patches?


Weird with the 14 patches it works almost perfect for me, but now it does not work for you guys :)

No idea why it would work without my patches, but there is a possible lock up when the fc_eh_host_reset functions runs. I am getting bh state errors when that is run and a lock up sometimes. Did you see the initiator go through the reset process? I see this with or without my patches though.

I have been running it hard and it has survived, but like I said in some mail I have not been able to run the abts response code (I have been trying to simulate it in this and other testing but I am thinking that is not doing the right thing). abts's ba_acc code path is untested in a real setup, because the software targets just seem to drop abts.

Do we get past the abort code and get to the fc_eh_host_reset code? Could you comment out the setting of the fc_eh_abort and fc_eh_device_reset in the scsi_host_template so we can narrow it down.


I think some more issues from these 14 patches or I failed to
understand your all changes in these 14 patches. So please help
me understand mainly changes on fc_exch.c and fcp abort handling.

1. Earlier fc_exch_abts_resp() would have completed exch for all upper
layers, that means fc_exch_timeout() would have send RRQ for sure under
ESB_ST_COMPLETE bit check.

        if (e_stat & ESB_ST_COMPLETE) {
              ep->esb_stat = e_stat & ~(ESB_ST_REC_QUAL | ESB_ST_COMPLETE);
              spin_unlock_bh(&ep->ex_lock);
              if (e_stat & ESB_ST_REC_QUAL)
                        fc_exch_rrq(ep);
          }

Now for fcp, exchange may not be completed and therefore check on

Do you mean fc_fcp may not call exch_done or that the fc_fcp.c exch_done call can race with the recov qual timer?

For the exch_done getting called I think it should alawys get called if we get a ba_accc. See below.

If you are saying that with my patch because fc_exch_abts_resp sets the ESB_ST_REC_QUAL bit and then sets the timer all under ex_lock, then drops the the lock to call the resp function which for fc_fcp.c would call exch_done, there is a race between the timer when we drop the lock and when fcp calls exch_done I agree on that.

For the abort resp case should I make it special, and for that fc_fcp.c should not call exch_done on the ep/sp, or should I just move the setting the of the timer to bottom of fc_exch_abts_resp and modify the other code to handle that. I think the first is a lot easier. The second one though allows us to have the same behavior for fc_fcp.c in all cases which is nice.


ESB_ST_COMPLETE bit check will prevent sending RRQ in that case and
exchange will not get freed since fc_exch_done() will abandon exch
free due to ESB_ST_REC_QUAL  bit set in exch_done() code path.

For the first part of the question you are referring to this in the patches:

        if (fh->fh_type != FC_TYPE_FCP &&
            ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
                rc = fc_exch_done_locked(ep);

vs this in the old code:
        if (ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
                fc_exch_complete_locked(ep);


I tried to follow the behavior in fc_exch_recv_seq_resp, where for non fcp commands fc_exch.c would do the complete/exch_done call, and for fcp commands we allow fc_fcp.c to call complete/exch_done.

So for something like the els rec fc_exch.c calls exch_done for the caller in the patch, because fc_fcp.c does not. But for a scsi commands we would not execute that code, and fc_fcp's resp handler calls the exch_done if we got a ba_acc (fc_fcp_recv->fc_fcp_abts_resp). If we got a reject then fc_fcp_abts_resp lets the command timeout and the lu reset code handle the cleanup, and if that fails then the mgr reset would clean it up. Thinking about that latter behavior again, I think that might be dumb and there is no need to let the abort or command timeout to escalate, and we can just cleanup the command in fc_fcp_abts_resp no matter what we got.

So if we got a ba_acc then fc_fcp.c should call exch_done, which sets the ESB_ST_COMPLETE bit. If we got a FC_RCTL_BA_RJT then we would not have a recovery qualifier so that path should be ok refcount wise and with the scsi-ml escalation we would eventually cleanup the fsp and ep.

Does that work or did I miss something?


I also don't understand resetting ESB_ST_COMPLETE bit in fc_exch_timeout().


In the bottom part of that patch
http://kernel.org/pub/linux/kernel/people/mnc/fcoe/16-08-08/0014-libfc-check-ESB_ST_COMPLETE-before-completing-ep-t.patch
I added the check for if two threads got holds on the ep, and both tried to call exch_done then only one would succeed. I do this by checking for ESB_ST_COMPLETE.

So the initial exch_done in fc_exch_abts_resp would hit the ESB_ST_REC_QUAL check and would not drop the hold from the alloc. But when fc_exch_rrq_resp runs when we call exch_done we want the drop to be done so I cleared it. Or if mgr reset runs we wanted the exch_done do be done.




2. Now abort handling is broken for non fc_fcp.c if no response to abort
after removed ESB_ST_ABNORMAL state handling in fc_exch_timeout. Earlier
abort was contained within fc_exch.c for all upper layers of fc_exch.c and
aborted exchange was done either on BA_ACC or ESB_ST_ABNORMAL handling in
fc_exch_timeout.


For ESB_ST_ABNORMAL we just did this:

        } else if (e_stat & ESB_ST_ABNORMAL) {
                ep->esb_stat |= ESB_ST_COMPLETE;
                spin_unlock_bh(&ep->ex_lock);

The ep is never cleaned up for non fcp commands here.


The problem with just setting the ESB_ST_COMPLETE bit and the other patches in the patchset is that when we try to reset the port we must notify the upper layers at that time that the command is getting killed., but the old exch reset code check if ESB_ST_COMPLETE was set and if it was it would not call the resp function.



In the patch
http://kernel.org/pub/linux/kernel/people/mnc/fcoe/16-08-08/0013-libfc-don-t-set-ESB_ST_COMPLETE-in-fc_exch_timeout.patch
I asked if we should be escalating the eh here. Should we be retrying the abort here or escalating and doing a reset or something? If we are doing something today, I did not see it.

For fcp commands, instead of waiting for a lun reset or port reset to come arount we could fail the ep with some new FC_EX value to indicate that it timedout at the exch layer.




3. The fc_fcp_recv() is getting called for abort issued from initiator but
not called for incoming abort though this is unlikely in initiator but yet
possible. I don't understand spec well on this case but I guess should
be same as sending abort and then drop all frames while abort finishes. You 
recently checked spec on this, so did you left not calling fcp in this
case on purpose as per spec or it is in your TODO ?

I left it on purpose because I did not understand the spec and wanted to check with you.

I did not see where in the spec it said we could get an abts for fcp commands in fcp3/4.

For fc-fs, I could not figure out where for class 3 we could get get an abts from the target for fcp command too. I saw some stuff llike in fc-fs 20.5.6.3 and 16.4.10.1 where for stop sequence it could set the abort sequence bits on acks to data, but I thought we would not be gettng/sending acks in class 3?

If we can get it in other sequences I thought we wanted to check for FC_FC_ABT_SEQ but I did not see it used anywhere.

So yeah it is on my todo to figure out what is up. I was hoping to ask you though :)


4. The -FC_EX_TIMEOUT handling in fc_fcp_error() doesn't make sense and I don't 
think force completion will ever occur in modified fc_fcp_error()
since from fc_exch.c the force completion will be issued -FC_EX_TIMEOUT
and that is ignored in fc_fcp_error().

I did not get that. If you are saying that fc_fcp_error should never get a -FC_EX_TIMEOUT because it never passes fc_exch.c a timeout then I agree. I will remove the code that checks for FC_EX_TIMEOUT in fc_fcp_error() since it is not needed.



5. The RRQ response drop exch ref for ACC but not if REJ.

The eh is completely broken in there. I have that on my todo. I was not sure what I am supposed to do. For a REJ can I just drop the ref and call exch_done or is there some condition in the target where I am supposed to retry?


I attached a patch which fixes the recov qual race, added back the ESB_ST_ABNORMAL check in fc_fcp_timeout (not having it can cause the resp handler for fcp commands to get called twice so maybe you hit that), remove the unneeded code in fc_fcp_error, and added the drop and done in the rrq response.

For #2, I was not sure how setting ESB_ST_COMPLETE did anything, so I did not add it back because it causes problems with the exch reset code. I think there must have been other code to cleanup and remove the ep in fc_exch_timeout when ESB_ST_ABNORMAL is found but I did not see it in some older releases either. I can add something if you want or let me know what I missed and I will refix this patch.

diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index 207e849..e36d8f3 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -394,7 +394,7 @@ static void fc_exch_timeout(unsigned long ep_arg)
                spin_unlock_bh(&ep->ex_lock);
                if (e_stat & ESB_ST_REC_QUAL)
                        fc_exch_rrq(ep);
-       } else {
+       } else if (!(e_stat & ESB_ST_ABNORMAL)) {
                resp = ep->resp;
                arg = ep->resp_arg;
                /*
@@ -1272,12 +1272,16 @@ static void fc_exch_abts_resp(struct fc_exch *ep, 
struct fc_frame *fp)
        struct fc_seq *sp;
        u16 low;
        u16 high;
-       int rc = 1;
+       int rc = 1, has_rec = 0;
 
        fh = fc_frame_header_get(fp);
        if (fc_exch_debug)
                FC_DBG("exch: BLS rctl %x - %s\n",
                       fh->fh_r_ctl, fc_exch_rctl_name(fh->fh_r_ctl));
+
+       if (del_timer_sync(&ep->ex_timer))
+               fc_exch_release(ep);    /* release from pending timer hold */
+
        spin_lock_bh(&ep->ex_lock);
        switch (fh->fh_r_ctl) {
        case FC_RCTL_BA_ACC:
@@ -1297,7 +1301,7 @@ static void fc_exch_abts_resp(struct fc_exch *ep, struct 
fc_frame *fp)
                     ap->ba_seq_id == ep->seq_id) && low != high) {
                        ep->esb_stat |= ESB_ST_REC_QUAL;
                        fc_exch_hold(ep);  /* hold for recovery qualifier */
-                       fc_exch_timer_set_locked(ep, ep->r_a_tov);
+                       has_rec = 1;
                }
                break;
        case FC_RCTL_BA_RJT:
@@ -1325,6 +1329,8 @@ static void fc_exch_abts_resp(struct fc_exch *ep, struct 
fc_frame *fp)
 
        if (resp)
                resp(sp, fp, ex_resp_arg);
+       if (has_rec)
+               fc_exch_timer_set(ep, ep->r_a_tov);
 
        fc_frame_free(fp);
 }
@@ -1625,7 +1631,7 @@ static void fc_exch_rrq_resp(struct fc_seq *sp, struct 
fc_frame *fp, void *arg)
        switch (op) {
        case ELS_LS_RJT:
                FC_DBG("LS_RJT for RRQ");
-               break;
+               /* fall through */
        case ELS_LS_ACC:
                fc_exch_done(&aborted_ep->seq);
                fc_exch_release(aborted_ep); /* drop hold for rec qual */
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index a050dd4..f2915ed 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1058,16 +1058,9 @@ static void fc_fcp_error(struct fc_fcp_pkt *fsp, struct 
fc_frame *fp)
        case -FC_EX_CLOSED:
                fc_fcp_retry_cmd(fsp);
                goto unlock;
-       case -FC_EX_TIMEOUT:
-               /*
-                * exch layer decided to abort exchange -
-                * will wait for response
-                */
-               fsp->state |= FC_SRB_ABORT_PENDING;
-               goto unlock;
+       default:
+               FC_DBG("unknown error %ld\n", PTR_ERR(fp));
        }
-
-       FC_DBG("unknown error %ld\n", PTR_ERR(fp));
        /*
         * clear abort pending, because the lower layer
         * decided to force completion.
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to