Mike Christie wrote:
Dev, Vasu wrote:
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On
Behalf Of [EMAIL PROTECTED]
Sent: Thursday, August 14, 2008 8:40 PM
To: [email protected]
Subject: [Open-FCoE] some fc exch reset fixes and a fix to my abort
patches
These patches were made over my last abort patches and over the
current tree.
They fix a couple bugs in fc_exch.c and fc_fcp.c. Most are old bugs,
but one was caused by my abort changes. The fix for the bug I added
is also needed for the old bugs I am fixing, so I am including it in
here
so I do not fill up the list by resending the old patchset. If you want
me to resend the old patches with this fix merged in it let me know
and I will do that instead.
This is fine and no need to resend patches in one merged set, I was able
to apply these patches after resolving very minor conflict due to recent
renaming of fc_fcp_recv().
I know I'm not responding to correct email/s thread for individual more
points below in this email but given so many of your floating patches,
I guess aggregate reply to all patches will expedite your patches
for inclusion to tree, therefore I'm writing all my points in this
single response.
I took your recent 10 patches series and followed by 4 more patches up to
patch title as "libfc: check ESB_ST_COMPLETE before completing ep
(take 2)"
and then applied these 14 patches to commit.
57929ea0ac20b23f894a557f2b0842dc1ec7d804 Robert Love 5 days ago
title:fcoe: Move protocol definitions fc_fcoe.h
These 14 patches works fine for basic testing :-) however not with
kicking error handling paths, more details below.
I tried a new error test to kick error handling paths with and without
your these 14 patches by dropping SCSI response to a SCSI read command.
With your 14 patches. My system locks up persistently after
single error test run, I mean system becomes non response without any
additional error messages though my kernel has spin lock debugging
enabled.
So I tried this error test without your 14 patches, this time I ran
into exch cleanup issue which is likely due to messed up exch ref
in RRQ which you have fixed in patch 7/10. However this time multiple
error test runs works fine without causing any system lock up.
However wireshark trace looks good with or without your patches
through REC,
REC RJT, ABTS, BA_ACC, RRQ and RRQ_ACC with my target when error test is
done.
So any idea why system locks up with your 14 patches?
Weird with the 14 patches it works almost perfect for me, but now it
does not work for you guys :)
No idea why it would work without my patches, but there is a possible
lock up when the fc_eh_host_reset functions runs. I am getting bh state
errors when that is run and a lock up sometimes. Did you see the
initiator go through the reset process? I see this with or without my
patches though.
I have been running it hard and it has survived, but like I said in some
mail I have not been able to run the abts response code (I have been
trying to simulate it in this and other testing but I am thinking that
is not doing the right thing). abts's ba_acc code path is untested in a
real setup, because the software targets just seem to drop abts.
Do we get past the abort code and get to the fc_eh_host_reset code?
Could you comment out the setting of the fc_eh_abort and
fc_eh_device_reset in the scsi_host_template so we can narrow it down.
I think some more issues from these 14 patches or I failed to
understand your all changes in these 14 patches. So please help
me understand mainly changes on fc_exch.c and fcp abort handling.
1. Earlier fc_exch_abts_resp() would have completed exch for all upper
layers, that means fc_exch_timeout() would have send RRQ for sure under
ESB_ST_COMPLETE bit check.
if (e_stat & ESB_ST_COMPLETE) {
ep->esb_stat = e_stat & ~(ESB_ST_REC_QUAL |
ESB_ST_COMPLETE);
spin_unlock_bh(&ep->ex_lock);
if (e_stat & ESB_ST_REC_QUAL)
fc_exch_rrq(ep);
}
Now for fcp, exchange may not be completed and therefore check on
Do you mean fc_fcp may not call exch_done or that the fc_fcp.c exch_done
call can race with the recov qual timer?
For the exch_done getting called I think it should alawys get called if
we get a ba_accc. See below.
If you are saying that with my patch because fc_exch_abts_resp sets the
ESB_ST_REC_QUAL bit and then sets the timer all under ex_lock, then
drops the the lock to call the resp function which for fc_fcp.c would
call exch_done, there is a race between the timer when we drop the lock
and when fcp calls exch_done I agree on that.
For the abort resp case should I make it special, and for that fc_fcp.c
should not call exch_done on the ep/sp, or should I just move the
setting the of the timer to bottom of fc_exch_abts_resp and modify the
other code to handle that. I think the first is a lot easier. The second
one though allows us to have the same behavior for fc_fcp.c in all cases
which is nice.
ESB_ST_COMPLETE bit check will prevent sending RRQ in that case and
exchange will not get freed since fc_exch_done() will abandon exch
free due to ESB_ST_REC_QUAL bit set in exch_done() code path.
For the first part of the question you are referring to this in the
patches:
if (fh->fh_type != FC_TYPE_FCP &&
ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
rc = fc_exch_done_locked(ep);
vs this in the old code:
if (ntoh24(fh->fh_f_ctl) & FC_FC_LAST_SEQ)
fc_exch_complete_locked(ep);
I tried to follow the behavior in fc_exch_recv_seq_resp, where for non
fcp commands fc_exch.c would do the complete/exch_done call, and for fcp
commands we allow fc_fcp.c to call complete/exch_done.
So for something like the els rec fc_exch.c calls exch_done for the
caller in the patch, because fc_fcp.c does not. But for a scsi commands
we would not execute that code, and fc_fcp's resp handler calls the
exch_done if we got a ba_acc (fc_fcp_recv->fc_fcp_abts_resp). If we got
a reject then fc_fcp_abts_resp lets the command timeout and the lu reset
code handle the cleanup, and if that fails then the mgr reset would
clean it up. Thinking about that latter behavior again, I think that
might be dumb and there is no need to let the abort or command timeout
to escalate, and we can just cleanup the command in fc_fcp_abts_resp no
matter what we got.
So if we got a ba_acc then fc_fcp.c should call exch_done, which sets
the ESB_ST_COMPLETE bit. If we got a FC_RCTL_BA_RJT then we would not
have a recovery qualifier so that path should be ok refcount wise and
with the scsi-ml escalation we would eventually cleanup the fsp and ep.
Does that work or did I miss something?
I also don't understand resetting ESB_ST_COMPLETE bit in
fc_exch_timeout().
In the bottom part of that patch
http://kernel.org/pub/linux/kernel/people/mnc/fcoe/16-08-08/0014-libfc-check-ESB_ST_COMPLETE-before-completing-ep-t.patch
I added the check for if two threads got holds on the ep, and both tried
to call exch_done then only one would succeed. I do this by checking for
ESB_ST_COMPLETE.
So the initial exch_done in fc_exch_abts_resp would hit the
ESB_ST_REC_QUAL check and would not drop the hold from the alloc. But
when fc_exch_rrq_resp runs when we call exch_done we want the drop to be
done so I cleared it. Or if mgr reset runs we wanted the exch_done do be
done.
This is not right for the mgr reset path. If the timer function was
running and fc_exch_reset had grabbed the ex_lock before the
fc_exch_timeout, then when fc_exch_reset or the resp handler tries to
call exch_done, exch_done may see that is ESB_ST_COMPLETE is set and
will not complete the ep. The timer would then grab the ex_lock and see
ESB_ST_COMPLETE and by this time ESB_ST_REC_QUAL is cleared so the timer
would just exit. However, the ep would stay on the list so when
fc_exch_mgr_reset sees it again we will call the resp handler again and
fc_fcp.c would have already freed the fsp.
So I think we need something like the attached patch. With that patch
and the handle-vasu-comments.patch patch patch, fc_exch_reset would
force the cleanup of the ep, then when fc_exch_timeout got the ex_lock
it would see the ESB_ST_ABNORMAL and just exit.
I guess we could also try to add another esb_stat bit instead of trying
to reuse ESB_ST_COMPLETE. I will look into that for another patch if you
want.
diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index a4d3901..65dcaa2 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -1452,7 +1452,7 @@ static void fc_exch_reset(struct fc_exch *ep)
ep->resp = NULL;
if (ep->esb_stat & ESB_ST_REC_QUAL)
atomic_dec(&ep->ex_refcnt); /* drop hold for rec_qual */
- ep->esb_stat &= ~ESB_ST_REC_QUAL;
+ ep->esb_stat &= ~(ESB_ST_REC_QUAL | ESB_ST_COMPLETE);
arg = ep->resp_arg;
if (del_timer(&ep->ex_timer))
atomic_dec(&ep->ex_refcnt); /* drop hold for timer */
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel