Re: [Open-FCoE] [PATCH 1/3] libfc: fix fc_eh_host_reset

Vasu Dev Wed, 12 Oct 2011 14:59:17 -0700

On Wed, 2011-10-12 at 01:43 -0700, Zou, Yi wrote:
> > On 8/9/2011 2:20 PM, Vasu Dev wrote:
> > > Current fc_eh_host_reset leaves lport offline
> > > permanently  due to FLOGI response getting
> > > handled by LOGO response from last reset as both
> > > had same exchange id.
> > >
> > > So fix this by having end to end exches clean-up
> > > using exchange abort along exches reset
> > > done from fc_eh_host_reset. This would avoid
> > > exchanges collision between the sessions across
> > > the reset. In this case implicit login should have
> > > done that but no aborting support for FIP
> > > frames, so just wait till lport->r_a_tov before
> > > restarting next flogi to ensure all exchanges
> > > are good to use again for next session.
> > >
> > > Below is the trace of LOGO from older session
> > > coming ahead of FLOGI response with same exche id
> > > 0x203:-
> > >
> > > 617  86.435165     4e.00.0b ->  ff.ff.fc     FC ELS LOGO 0x203
> > > 618  86.435195     4e.00.0b ->  b6.02.00     FC ELS LOGO 0x213
> > > 619  86.435220     4e.00.0b ->  18.03.00     FC ELS LOGO 0x223
> > > 620  86.435244     4e.00.0b ->  18.02.00     FC ELS LOGO 0x233
> > > 621  86.435267     4e.00.0b ->  18.01.00     FC ELS LOGO 0x243
> > > 622  86.435349     00.00.00 ->  ff.ff.fe     FC ELS FLOGI 0x203
> > > 623  86.435549     ff.ff.fc ->  4e.00.0b     FC ELS ACC (LOGO) 0x203
> > > 624  86.438721     ff.ff.fe ->  4e.00.0b     FC ELS ACC (FLOGI) 0x203
> > > 625  86.442059     18.03.00 ->  4e.00.0b     FC ELS ACC (LOGO) 0x223
> > > 626  86.443683     b6.02.00 ->  4e.00.0b     FC ELS ACC (LOGO) 0x213
> > > 627  86.447693     18.01.00 ->  4e.00.0b     FC ELS ACC (LOGO) 0x243
> > > 628  86.453499     18.02.00 ->  4e.00.0b     FC ELS ACC (LOGO) 0x233
> > >
> > > Signed-off-by: Vasu Dev<[email protected]>
> > > Tested-by: Ross Brattain<[email protected]>
> > > ---
> > I'm seeing a couple side effects with this change
> > 
> > 1. r_a_tov delay is present not only when fc_eh_host_reset is called,
> > but every time the lport is reset, as fc_lport_enter_reset() has the
> > delay. fc_lport_enter_reset is called from multiple contexts, and some
> > of them have rtnl_lock() held.  Because of this change we now have a
> > delay of 10 secs whenever an FCoE interface is created.
> > 
> > 2. When the FCoE interface is destroyed, we now send a LOGO followed
> > immediately by ABTS. Although there is no functional problem, it doesn't
> > look right from protocol point-of-view, as the exchange didn't really
> > timeout to issue an ABTS.
> > 
> > Can you please elaborate in what circumstances do we see this problem
> > (apart from what is described)? Is the delay of 10 secs required in all
> > the scenarios? I'm afraid it may lead to system slowdown as we sleep
> > with multiple locks held. One such issue is observed when creating
> > multiple NPIV ports.  We also see the system temporarily hangs for 10
> > secs during the msleep() when the FCoE interface is created.
> > 
> > If this is meant only for fc_eh_host_reset(), can we somehow localize
> > this fix only to that context?


It is possible with each login just after the reset as you can see from
trace attached along the patch, so not just with fc_eh_host_reset().

> > 
> > Thanks,
> > Bhanu
> 
> I think the problem the patch is trying to fix is already illustraced
> from the included trace. Basically it's a conflict of exch id due to the
> fact that initiator side thinks the exch resources are free since the
> fc_eh_host_reset() would result exch_mgr_reset, so I think any path that
> ends fc_lport_reset() will likely to have this issue, fc_eh_host_reset()
> or the other one I see is disabling fc_lport.

Yes fc_lport_reset() code path has issues here on exch reuse/conflict
and this is complicate with FIP as that doesn't have abort concept.
However I see problem with added msleep() also as Bhanu pointed
especially sleep with lock hold. The delay during reset should be
tolerable as reset are unlikely un-event but good to avoid if possible,
so let me look into different fix here or least do delay only if LOGO is
not completed w/o msleep.


> 
> I can see the side effects you described here, particularly that it is
> not nice to msleep() w/ multiple locks held. Since currently there is
> no exch timeout value (not sure why?) for LOGO on rports, i.e., LOGOs
> shown in this trace, and as Vasu mentioned no ABTS on FIP, sending about
> along the exch reset path is a solution to guarantee the exch on both
> ends are in synced up state.
> 
> Bottom line is that we have to know the exch is reusable to the other side
> before reissuing the flogi, I guess we really need timeout on rport LOGOs,
> not just fabric one, and we have to wait for completion of LOGO ACC/RJT
> or the timeout before continuing. 

Yeah that would work and timeout other than msleep() is required here
only if not completed.

Thanks
Vasu



_______________________________________________
devel mailing list
[email protected]
https://lists.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] [PATCH 1/3] libfc: fix fc_eh_host_reset

Reply via email to