On 9/16/10 6:50 PM, Robert Love wrote:
> On Tue, 2010-09-07 at 18:58 +0000, Bhanu Gollapudi wrote:
>> On Fri, 2010-09-03 at 11:23 -0700, Robert Love wrote:
>>
>>>>
>>>> Absolutely, this is the real root cause, and it needs to be fixed.
>>>> However running out of resources can lead into this hang quickly, and
>>>> even if we fix it discovery takes a long time to complete after many
>>>> retries, as we use only a subset of xid resources. Also, I was not very
>>>> happy to see the xid resources shrink as the CPUs in the system goes
>>>> higher, and hence submitted this change. Is there any downside of having
>>>> the common pool?
>>>>
>
> Is the main reason that the CPUs xid space is exhausted because the same
> context that receives the link up event from netdev sends all of the
> FLOGIs? Maybe we should defer the re-login so that the scheduler can do
> a better job of distributing the FLOGI xid allocations across all
> available CPUs.
>
>>> No, not that I can think of, at least not with the non-offloaded EM. I
>>> just wanted to point out that making the common pool wouldn't
>>> necessarily _fix_ your problem, only make it less likely.
>>>
>>> I think that the OEM would not want the common pool though. There are
>>> tricks that can be done to bind irqs to CPUs such that I/O stays on the
>>> same CPU. This is done by ensuring that certain XIDs are always mapped
>>> to certain CPUs. By creating a common pool for the OEM that mapping
>>> isn't consistent for the common XIDs.
>>
>
> I've modified your patch such that EMs can be allocated as 'default,'
> where there are only per-cpu pools, or as 'common,' where there are both
> per-cpu pools and a common pool. I'm a bit uneasy about this
> implementation because it means that there are both unused EM members
> (common_pool, pool_max_index, etc...) and unused local variables when
> dealing with EMs of the 'default' scheme. It makes the code more
> confusing.
>
> Instead of having one EM that has both per-cpu pools and a common pool,
> I was considering leaving the current per-cpu EM logic the same and then
> creating a different singular pool EM. If we did this then we could just
> add the singular pool EM after the per-cpu EM in our EM list.
>
> We would also need to ensure that if one EM can't allocate an exchange
> then we try on the next EM in the list. This is something that I don't
> think is happening right now. That's a bug though and we should fix it
> anyway, because as it is if the OEM fails to allocate an exchange we
> won't fall back to the generic EM.
>
> I think this additional EM versus EM + common pool talk may be taking us
> in the wrong direction though. I think it's all well and good, but if we
> get the re-logins on different CPUs we may do a better job of avoiding
> the starvation.
>
>> OK.
>>
>>>
>>> I want to bring up a slightly unrelated topic as a talking point. I
>>> discovered yesterday that NPIV is broken. This happened when we added
>>> destination MAC address checking to fcoe.ko. The problem is that we're
>>> only checking the N_Port's MAC address and not walking the list of NPIV
>>> VN_Ports to check their MACs. The result is that we fail the validation
>>> of all NPIV VN_Ports, so what you see on the wire is a successful FDISC,
>>> a PLOGI to the name server and an ACC in response. The initiator then
>>> sends an ABTS for the PLOGI to the name server because fcoe.ko prevents
>>> libfc from seeing the ACC response.
>>>
>>> Yesterday, we debated a few solutions to this problem and the winning
>>> solution seems to be to make the lport's VN_Port list read-safe so that
>>> we can walk that list to check MACs without grabbing a lock (Chris'
>>> idea). Currently, we only pass the N_Port lport up to libfc, find the
>>> exchange from its EM and then call the upper layer callback (i.e.
>>> fc_rport_plogi_resp()). It isn't until we're in the callback that we
>>> look for the lport.
>>
>> Yes, currently fc_vport_id_lookup() is not exported, and then it holds
>> lp_mutex.
>>
>
> We may be able to do the FC-BB-5 MAC destination checking in a simpler
> way than I described above. I will post a patch as soon as I've got it
> sorted out.
>
>>>
>>> I realize that this isn't your problem, but what it does do is have
>>> fcoe.ko pass the correct lport up to libfc (fc_exch_recv()). This means
>>> that we don't need to rely on the exchange lookup to find the "real"
>>> lport, which would allow us to have an EM per-lport. This allows for
>>> scaling as NPIV ports are added, but doesn't really help with CPU
>>> scaling.
>>
>> Yes, CPU scaling is still an issue. May be the number of XIDs can be a
>> multiple of nr_cpu_ids, but if we have a system with a large number of
>> cpus, we can exceed the XID limit of 64k.
>>
>> Another problem I observed with high number of NPIV ports is that, when
>> they all fall in the same zone, each NPIV port tries to login to the
>> other one causing it to establish (n*n-1) login sessions, which
>> eventually leads to hang scenario. I was wondering, should we avoid the
>> logins to NPIV ports if they belong to the same physical port?
>>
>
> I'd prefer that we do things like increase the xid space and try to get
> the logins to fan out on the available CPUs before restricting the NPIV
> ports from logging into each other. It may not be an issue if we resolve
> the other problems.
All this will work, but there's no harm in taking an exchange from another
CPU's pool if the local pool runs out. It just creates a little more
lock contention. I think that's simpler than queuing work to another CPU.
Another alternative that might be better: for lport->tt.exch_seq_send(),
have a transmit queue of frames waiting for an exchange and send them
when an exchange frees up. This would avoid the problem entirely and
we wouldn't need a common pool. Four parameters for lport->tt.exch_seq_send()
would need to be stored somewhere for the frame that's queued, but that
could be done in the data part of the frame since there's prepend
room for the FCoE and Ethernet headers, or in the skb->cb but I don't
think there's space there (unless we overload seq, fsp, crc, and ptype),
which aren't used at that point.
Joe
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel