Re: [Open-FCoE] some fc exch reset fixes and a fix to my abort patches

Mike Christie Mon, 18 Aug 2008 10:54:56 -0700

Dev, Vasu wrote:

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of [EMAIL PROTECTED]
Sent: Thursday, August 14, 2008 8:40 PM
To: [email protected]
Subject: [Open-FCoE] some fc exch reset fixes and a fix to my abort patches


These patches were made over my last abort patches and over the
current tree.

They fix a couple bugs in fc_exch.c and fc_fcp.c. Most are old bugs,
but one was caused by my abort changes. The fix for the bug I added
is also needed for the old bugs I am fixing, so I am including it in here
so I do not fill up the list by resending the old patchset. If you want
me to resend the old patches with this fix merged in it let me know
and I will do that instead.


This is fine and no need to resend patches in one merged set, I was able
to apply these patches after resolving very minor conflict due to recent
renaming of fc_fcp_recv().

I know I'm not responding to correct email/s thread for individual more
points below in this email but given so many of your floating patches,
I guess aggregate reply to all patches will expedite your patches
for inclusion to tree, therefore I'm writing all my points in this single 
response.

I took your recent 10 patches series and followed by 4 more patches up to
patch title as  "libfc: check ESB_ST_COMPLETE before completing ep (take 2)"
and then applied these 14 patches to commit.

57929ea0ac20b23f894a557f2b0842dc1ec7d804 Robert Love 5 days ago title:fcoe: 
Move protocol definitions fc_fcoe.h

These 14 patches works fine for basic testing :-) however not with
kicking error handling paths, more details below.

I tried a new error test to kick error handling paths with and without
your these 14 patches by dropping SCSI response to a SCSI read command.

With your 14 patches. My system locks up persistently after
single error test run, I mean system becomes non response without any
additional error messages though my kernel has spin lock debugging enabled.

So I tried this error test without your 14 patches, this time I ran
into exch cleanup issue which is likely due to messed up exch ref
in RRQ which you have fixed in patch 7/10. However this time multiple
error test runs works fine without causing any system lock up.

However wireshark trace looks good with or without your patches through REC,
REC RJT, ABTS, BA_ACC, RRQ and RRQ_ACC with my target when error test is
done.

So any idea why system locks up with your 14 patches?

Weird with the 14 patches it works almost perfect for me, but now itdoes not work for you guys :)

No idea why it would work without my patches, but there is a possiblelock up when the fc_eh_host_reset functions runs. I am getting bh stateerrors when that is run and a lock up sometimes. Did you see theinitiator go through the reset process? I see this with or without mypatches though.

I have been running it hard and it has survived, but like I said in somemail I have not been able to run the abts response code (I have beentrying to simulate it in this and other testing but I am thinking thatis not doing the right thing). abts's ba_acc code path is untested in areal setup, because the software targets just seem to drop abts.

Do we get past the abort code and get to the fc_eh_host_reset code?Could you comment out the setting of the fc_eh_abort andfc_eh_device_reset in the scsi_host_template so we can narrow it down.

I think some more issues from these 14 patches or I failed to
understand your all changes in these 14 patches. So please help
me understand mainly changes on fc_exch.c and fcp abort handling.

1. Earlier fc_exch_abts_resp() would have completed exch for all upper
layers, that means fc_exch_timeout() would have send RRQ for sure under
ESB_ST_COMPLETE bit check.

        if (e_stat & ESB_ST_COMPLETE) {
              ep->esb_stat = e_stat & ~(ESB_ST_REC_QUAL | ESB_ST_COMPLETE);
              spin_unlock_bh(&ep->ex_lock);
              if (e_stat & ESB_ST_REC_QUAL)
                        fc_exch_rrq(ep);
          }

Now for fcp, exchange may not be completed and therefore check on

Do you mean fc_fcp may not call exch_done or that the fc_fcp.c exch_donecall can race with the recov qual timer?

For the exch_done getting called I think it should alawys get called ifwe get a ba_accc. See below.

If you are saying that with my patch because fc_exch_abts_resp sets theESB_ST_REC_QUAL bit and then sets the timer all under ex_lock, thendrops the the lock to call the resp function which for fc_fcp.c wouldcall exch_done, there is a race between the timer when we drop the lockand when fcp calls exch_done I agree on that.

For the abort resp case should I make it special, and for that fc_fcp.cshould not call exch_done on the ep/sp, or should I just move thesetting the of the timer to bottom of fc_exch_abts_resp and modify theother code to handle that. I think the first is a lot easier. The secondone though allows us to have the same behavior for fc_fcp.c in all caseswhich is nice.

ESB_ST_COMPLETE bit check will prevent sending RRQ in that case and
exchange will not get freed since fc_exch_done() will abandon exch
free due to ESB_ST_REC_QUAL  bit set in exch_done() code path.


For the first part of the question you are referring to this in the patches:

        if (fh->fh_type != FC_TYPE_FCP &&
            ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
                rc = fc_exch_done_locked(ep);

vs this in the old code:
        if (ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
                fc_exch_complete_locked(ep);

I tried to follow the behavior in fc_exch_recv_seq_resp, where for nonfcp commands fc_exch.c would do the complete/exch_done call, and for fcpcommands we allow fc_fcp.c to call complete/exch_done.

So for something like the els rec fc_exch.c calls exch_done for thecaller in the patch, because fc_fcp.c does not. But for a scsi commandswe would not execute that code, and fc_fcp's resp handler calls theexch_done if we got a ba_acc (fc_fcp_recv->fc_fcp_abts_resp). If we gota reject then fc_fcp_abts_resp lets the command timeout and the lu resetcode handle the cleanup, and if that fails then the mgr reset wouldclean it up. Thinking about that latter behavior again, I think thatmight be dumb and there is no need to let the abort or command timeoutto escalate, and we can just cleanup the command in fc_fcp_abts_resp nomatter what we got.

So if we got a ba_acc then fc_fcp.c should call exch_done, which setsthe ESB_ST_COMPLETE bit. If we got a FC_RCTL_BA_RJT then we would nothave a recovery qualifier so that path should be ok refcount wise andwith the scsi-ml escalation we would eventually cleanup the fsp and ep.


Does that work or did I miss something?


I also don't understand resetting ESB_ST_COMPLETE bit in fc_exch_timeout().



In the bottom part of that patch
http://kernel.org/pub/linux/kernel/people/mnc/fcoe/16-08-08/0014-libfc-check-ESB_ST_COMPLETE-before-completing-ep-t.patch

I added the check for if two threads got holds on the ep, and both triedto call exch_done then only one would succeed. I do this by checking forESB_ST_COMPLETE.

So the initial exch_done in fc_exch_abts_resp would hit theESB_ST_REC_QUAL check and would not drop the hold from the alloc. Butwhen fc_exch_rrq_resp runs when we call exch_done we want the drop to bedone so I cleared it. Or if mgr reset runs we wanted the exch_done do bedone.


2. Now abort handling is broken for non fc_fcp.c if no response to abort
after removed ESB_ST_ABNORMAL state handling in fc_exch_timeout. Earlier
abort was contained within fc_exch.c for all upper layers of fc_exch.c and
aborted exchange was done either on BA_ACC or ESB_ST_ABNORMAL handling in
fc_exch_timeout.



For ESB_ST_ABNORMAL we just did this:

        } else if (e_stat & ESB_ST_ABNORMAL) {
                ep->esb_stat |= ESB_ST_COMPLETE;
                spin_unlock_bh(&ep->ex_lock);

The ep is never cleaned up for non fcp commands here.

The problem with just setting the ESB_ST_COMPLETE bit and the otherpatches in the patchset is that when we try to reset the port we mustnotify the upper layers at that time that the command is gettingkilled., but the old exch reset code check if ESB_ST_COMPLETE was setand if it was it would not call the resp function.




In the patch
http://kernel.org/pub/linux/kernel/people/mnc/fcoe/16-08-08/0013-libfc-don-t-set-ESB_ST_COMPLETE-in-fc_exch_timeout.patch

I asked if we should be escalating the eh here. Should we be retryingthe abort here or escalating and doing a reset or something? If we aredoing something today, I did not see it.

For fcp commands, instead of waiting for a lun reset or port reset tocome arount we could fail the ep with some new FC_EX value to indicatethat it timedout at the exch layer.


3. The fc_fcp_recv() is getting called for abort issued from initiator but
not called for incoming abort though this is unlikely in initiator but yet
possible. I don't understand spec well on this case but I guess should
be same as sending abort and then drop all frames while abort finishes. You 
recently checked spec on this, so did you left not calling fcp in this
case on purpose as per spec or it is in your TODO ?

I left it on purpose because I did not understand the spec and wanted tocheck with you.

I did not see where in the spec it said we could get an abts for fcpcommands in fcp3/4.

For fc-fs, I could not figure out where for class 3 we could get get anabts from the target for fcp command too. I saw some stuff llike infc-fs 20.5.6.3 and 16.4.10.1 where for stop sequence it could set theabort sequence bits on acks to data, but I thought we would not begettng/sending acks in class 3?

If we can get it in other sequences I thought we wanted to check forFC_FC_ABT_SEQ but I did not see it used anywhere.

So yeah it is on my todo to figure out what is up. I was hoping to askyou though :)


4. The -FC_EX_TIMEOUT handling in fc_fcp_error() doesn't make sense and I don't 
think force completion will ever occur in modified fc_fcp_error()
since from fc_exch.c the force completion will be issued -FC_EX_TIMEOUT
and that is ignored in fc_fcp_error().

I did not get that. If you are saying that fc_fcp_error should never geta -FC_EX_TIMEOUT because it never passes fc_exch.c a timeout then Iagree. I will remove the code that checks for FC_EX_TIMEOUT infc_fcp_error() since it is not needed.


5. The RRQ response drop exch ref for ACC but not if REJ.

The eh is completely broken in there. I have that on my todo. I was notsure what I am supposed to do. For a REJ can I just drop the ref andcall exch_done or is there some condition in the target where I amsupposed to retry?

I attached a patch which fixes the recov qual race, added back theESB_ST_ABNORMAL check in fc_fcp_timeout (not having it can cause theresp handler for fcp commands to get called twice so maybe you hitthat), remove the unneeded code in fc_fcp_error, and added the drop anddone in the rrq response.

For #2, I was not sure how setting ESB_ST_COMPLETE did anything, so Idid not add it back because it causes problems with the exch reset code.I think there must have been other code to cleanup and remove the ep infc_exch_timeout when ESB_ST_ABNORMAL is found but I did not see it insome older releases either. I can add something if you want or let meknow what I missed and I will refix this patch.

diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index 207e849..e36d8f3 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -394,7 +394,7 @@ static void fc_exch_timeout(unsigned long ep_arg)
                spin_unlock_bh(&ep->ex_lock);
                if (e_stat & ESB_ST_REC_QUAL)
                        fc_exch_rrq(ep);
-       } else {
+       } else if (!(e_stat & ESB_ST_ABNORMAL)) {
                resp = ep->resp;
                arg = ep->resp_arg;
                /*
@@ -1272,12 +1272,16 @@ static void fc_exch_abts_resp(struct fc_exch *ep, 
struct fc_frame *fp)
        struct fc_seq *sp;
        u16 low;
        u16 high;
-       int rc = 1;
+       int rc = 1, has_rec = 0;
 
        fh = fc_frame_header_get(fp);
        if (fc_exch_debug)
                FC_DBG("exch: BLS rctl %x - %s\n",
                       fh->fh_r_ctl, fc_exch_rctl_name(fh->fh_r_ctl));
+
+       if (del_timer_sync(&ep->ex_timer))
+               fc_exch_release(ep);    /* release from pending timer hold */
+
        spin_lock_bh(&ep->ex_lock);
        switch (fh->fh_r_ctl) {
        case FC_RCTL_BA_ACC:
@@ -1297,7 +1301,7 @@ static void fc_exch_abts_resp(struct fc_exch *ep, struct 
fc_frame *fp)
                     ap->ba_seq_id == ep->seq_id) && low != high) {
                        ep->esb_stat |= ESB_ST_REC_QUAL;
                        fc_exch_hold(ep);  /* hold for recovery qualifier */
-                       fc_exch_timer_set_locked(ep, ep->r_a_tov);
+                       has_rec = 1;
                }
                break;
        case FC_RCTL_BA_RJT:
@@ -1325,6 +1329,8 @@ static void fc_exch_abts_resp(struct fc_exch *ep, struct 
fc_frame *fp)
 
        if (resp)
                resp(sp, fp, ex_resp_arg);
+       if (has_rec)
+               fc_exch_timer_set(ep, ep->r_a_tov);
 
        fc_frame_free(fp);
 }
@@ -1625,7 +1631,7 @@ static void fc_exch_rrq_resp(struct fc_seq *sp, struct 
fc_frame *fp, void *arg)
        switch (op) {
        case ELS_LS_RJT:
                FC_DBG("LS_RJT for RRQ");
-               break;
+               /* fall through */
        case ELS_LS_ACC:
                fc_exch_done(&aborted_ep->seq);
                fc_exch_release(aborted_ep); /* drop hold for rec qual */
diff --git a/drivers/scsi/libfc/fc_fcp.c b/drivers/scsi/libfc/fc_fcp.c
index a050dd4..f2915ed 100644
--- a/drivers/scsi/libfc/fc_fcp.c
+++ b/drivers/scsi/libfc/fc_fcp.c
@@ -1058,16 +1058,9 @@ static void fc_fcp_error(struct fc_fcp_pkt *fsp, struct 
fc_frame *fp)
        case -FC_EX_CLOSED:
                fc_fcp_retry_cmd(fsp);
                goto unlock;
-       case -FC_EX_TIMEOUT:
-               /*
-                * exch layer decided to abort exchange -
-                * will wait for response
-                */
-               fsp->state |= FC_SRB_ABORT_PENDING;
-               goto unlock;
+       default:
+               FC_DBG("unknown error %ld\n", PTR_ERR(fp));
        }
-
-       FC_DBG("unknown error %ld\n", PTR_ERR(fp));
        /*
         * clear abort pending, because the lower layer
         * decided to force completion.

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] some fc exch reset fixes and a fix to my abort patches

Reply via email to