Re: [Open-FCoE] some fc exch reset fixes and a fix to my abort patches

Mike Christie Mon, 18 Aug 2008 14:35:31 -0700

Mike Christie wrote:

Dev, Vasu wrote:
-----Original Message-----
From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED] On
Behalf Of [EMAIL PROTECTED]
Sent: Thursday, August 14, 2008 8:40 PM
To: [email protected]
Subject: [Open-FCoE] some fc exch reset fixes and a fix to my abortpatches
These patches were made over my last abort patches and over the
current tree.

They fix a couple bugs in fc_exch.c and fc_fcp.c. Most are old bugs,
but one was caused by my abort changes. The fix for the bug I added
is also needed for the old bugs I am fixing, so I am including it inhere
so I do not fill up the list by resending the old patchset. If you want
me to resend the old patches with this fix merged in it let me know
and I will do that instead.
This is fine and no need to resend patches in one merged set, I was able
to apply these patches after resolving very minor conflict due to recent
renaming of fc_fcp_recv().

I know I'm not responding to correct email/s thread for individual more
points below in this email but given so many of your floating patches,
I guess aggregate reply to all patches will expedite your patches
for inclusion to tree, therefore I'm writing all my points in thissingle response.
I took your recent 10 patches series and followed by 4 more patches up to
patch title as "libfc: check ESB_ST_COMPLETE before completing ep(take 2)"
and then applied these 14 patches to commit.
57929ea0ac20b23f894a557f2b0842dc1ec7d804 Robert Love 5 days agotitle:fcoe: Move protocol definitions fc_fcoe.h
These 14 patches works fine for basic testing :-) however not with
kicking error handling paths, more details below.

I tried a new error test to kick error handling paths with and without
your these 14 patches by dropping SCSI response to a SCSI read command.

With your 14 patches. My system locks up persistently after
single error test run, I mean system becomes non response without any
additional error messages though my kernel has spin lock debuggingenabled.
So I tried this error test without your 14 patches, this time I ran
into exch cleanup issue which is likely due to messed up exch ref
in RRQ which you have fixed in patch 7/10. However this time multiple
error test runs works fine without causing any system lock up.
However wireshark trace looks good with or without your patchesthrough REC,
REC RJT, ABTS, BA_ACC, RRQ and RRQ_ACC with my target when error test is
done.

So any idea why system locks up with your 14 patches?
Weird with the 14 patches it works almost perfect for me, but now itdoes not work for you guys :)
No idea why it would work without my patches, but there is a possiblelock up when the fc_eh_host_reset functions runs. I am getting bh stateerrors when that is run and a lock up sometimes. Did you see theinitiator go through the reset process? I see this with or without mypatches though.
I have been running it hard and it has survived, but like I said in somemail I have not been able to run the abts response code (I have beentrying to simulate it in this and other testing but I am thinking thatis not doing the right thing). abts's ba_acc code path is untested in areal setup, because the software targets just seem to drop abts.
Do we get past the abort code and get to the fc_eh_host_reset code?Could you comment out the setting of the fc_eh_abort andfc_eh_device_reset in the scsi_host_template so we can narrow it down.
I think some more issues from these 14 patches or I failed to
understand your all changes in these 14 patches. So please help
me understand mainly changes on fc_exch.c and fcp abort handling.

1. Earlier fc_exch_abts_resp() would have completed exch for all upper
layers, that means fc_exch_timeout() would have send RRQ for sure under
ESB_ST_COMPLETE bit check.

        if (e_stat & ESB_ST_COMPLETE) {
ep->esb_stat = e_stat & ~(ESB_ST_REC_QUAL |ESB_ST_COMPLETE);
              spin_unlock_bh(&ep->ex_lock);
              if (e_stat & ESB_ST_REC_QUAL)
                        fc_exch_rrq(ep);
          }

Now for fcp, exchange may not be completed and therefore check on
Do you mean fc_fcp may not call exch_done or that the fc_fcp.c exch_donecall can race with the recov qual timer?
For the exch_done getting called I think it should alawys get called ifwe get a ba_accc. See below.
If you are saying that with my patch because fc_exch_abts_resp sets theESB_ST_REC_QUAL bit and then sets the timer all under ex_lock, thendrops the the lock to call the resp function which for fc_fcp.c wouldcall exch_done, there is a race between the timer when we drop the lockand when fcp calls exch_done I agree on that.
For the abort resp case should I make it special, and for that fc_fcp.cshould not call exch_done on the ep/sp, or should I just move thesetting the of the timer to bottom of fc_exch_abts_resp and modify theother code to handle that. I think the first is a lot easier. The secondone though allows us to have the same behavior for fc_fcp.c in all caseswhich is nice.
ESB_ST_COMPLETE bit check will prevent sending RRQ in that case and
exchange will not get freed since fc_exch_done() will abandon exch
free due to ESB_ST_REC_QUAL  bit set in exch_done() code path.
For the first part of the question you are referring to this in thepatches:
        if (fh->fh_type != FC_TYPE_FCP &&
            ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
                rc = fc_exch_done_locked(ep);

vs this in the old code:
        if (ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
                fc_exch_complete_locked(ep);
I tried to follow the behavior in fc_exch_recv_seq_resp, where for nonfcp commands fc_exch.c would do the complete/exch_done call, and for fcpcommands we allow fc_fcp.c to call complete/exch_done.
So for something like the els rec fc_exch.c calls exch_done for thecaller in the patch, because fc_fcp.c does not. But for a scsi commandswe would not execute that code, and fc_fcp's resp handler calls theexch_done if we got a ba_acc (fc_fcp_recv->fc_fcp_abts_resp). If we gota reject then fc_fcp_abts_resp lets the command timeout and the lu resetcode handle the cleanup, and if that fails then the mgr reset wouldclean it up. Thinking about that latter behavior again, I think thatmight be dumb and there is no need to let the abort or command timeoutto escalate, and we can just cleanup the command in fc_fcp_abts_resp nomatter what we got.
So if we got a ba_acc then fc_fcp.c should call exch_done, which setsthe ESB_ST_COMPLETE bit. If we got a FC_RCTL_BA_RJT then we would nothave a recovery qualifier so that path should be ok refcount wise andwith the scsi-ml escalation we would eventually cleanup the fsp and ep.
Does that work or did I miss something?
I also don't understand resetting ESB_ST_COMPLETE bit infc_exch_timeout().
In the bottom part of that patch
http://kernel.org/pub/linux/kernel/people/mnc/fcoe/16-08-08/0014-libfc-check-ESB_ST_COMPLETE-before-completing-ep-t.patchI added the check for if two threads got holds on the ep, and both triedto call exch_done then only one would succeed. I do this by checking forESB_ST_COMPLETE.
So the initial exch_done in fc_exch_abts_resp would hit theESB_ST_REC_QUAL check and would not drop the hold from the alloc. Butwhen fc_exch_rrq_resp runs when we call exch_done we want the drop to bedone so I cleared it. Or if mgr reset runs we wanted the exch_done do bedone.

This is not right for the mgr reset path. If the timer function wasrunning and fc_exch_reset had grabbed the ex_lock before thefc_exch_timeout, then when fc_exch_reset or the resp handler tries tocall exch_done, exch_done may see that is ESB_ST_COMPLETE is set andwill not complete the ep. The timer would then grab the ex_lock and seeESB_ST_COMPLETE and by this time ESB_ST_REC_QUAL is cleared so the timerwould just exit. However, the ep would stay on the list so whenfc_exch_mgr_reset sees it again we will call the resp handler again andfc_fcp.c would have already freed the fsp.

So I think we need something like the attached patch. With that patchand the handle-vasu-comments.patch patch patch, fc_exch_reset wouldforce the cleanup of the ep, then when fc_exch_timeout got the ex_lockit would see the ESB_ST_ABNORMAL and just exit.

I guess we could also try to add another esb_stat bit instead of tryingto reuse ESB_ST_COMPLETE. I will look into that for another patch if youwant.

diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index a4d3901..65dcaa2 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -1452,7 +1452,7 @@ static void fc_exch_reset(struct fc_exch *ep)
        ep->resp = NULL;
        if (ep->esb_stat & ESB_ST_REC_QUAL)
                atomic_dec(&ep->ex_refcnt);     /* drop hold for rec_qual */
-       ep->esb_stat &= ~ESB_ST_REC_QUAL;
+       ep->esb_stat &= ~(ESB_ST_REC_QUAL | ESB_ST_COMPLETE);
        arg = ep->resp_arg;
        if (del_timer(&ep->ex_timer))
                atomic_dec(&ep->ex_refcnt);     /* drop hold for timer */

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] some fc exch reset fixes and a fix to my abort patches

Reply via email to