** Changed in: linux (Ubuntu Xenial)
Status: In Progress => Fix Committed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1587316
Title:
STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times
during boot then disabled SRC BA188002:b0314a_1612.840
Status in linux package in Ubuntu:
Fix Released
Status in linux source package in Xenial:
Fix Committed
Status in linux source package in Yakkety:
Fix Released
Bug description:
== Comment: #0 - Application Cdeadmin <[email protected]> -
2016-03-21 15:55:09 ==
== Comment: #1 - Application Cdeadmin <[email protected]> - 2016-03-21
15:55:11 ==
==== State: Open by: mlfield on 21 March 2016 14:45:01 ====
==========================Automatic entries==========================
Contact: LittleField, Michael *CONTRACTOR*
Backup: Thirukumaran V T ([email protected]), Deepti Umarani
([email protected]), Brian M. Carpenter([email protected])
===== sys_capture v5.24 === 2016-03-21_14-25-41 ===========
|
| |
| System Hardware Information:
| NODE /Sys-0/Node-0, U78C7.001.1AQH383-P2
| FSP /Sys-0/Node-0/FSP-0, FSP-2 DD 1.0, U78C7.001.1AQH383-P1-C5
| PSI /Sys-0/Node-0/FSP-0/PSI-0
| PSI /Sys-0/Node-0/FSP-0/PSI-1
| MEMBUF /Sys-0/Node-0/Membuf-12, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C11
| MEMBUF /Sys-0/Node-0/Membuf-13, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C10
| MEMBUF /Sys-0/Node-0/Membuf-14, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C12
| MEMBUF /Sys-0/Node-0/Membuf-15, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C13
| MEMBUF /Sys-0/Node-0/Membuf-20, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C23
| MEMBUF /Sys-0/Node-0/Membuf-21, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C22
| MEMBUF /Sys-0/Node-0/Membuf-22, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C24
| MEMBUF /Sys-0/Node-0/Membuf-23, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C25
| MEMBUF /Sys-0/Node-0/Membuf-28, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C19
| MEMBUF /Sys-0/Node-0/Membuf-29, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C18
| MEMBUF /Sys-0/Node-0/Membuf-30, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C20
| MEMBUF /Sys-0/Node-0/Membuf-31, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C21
| MEMBUF /Sys-0/Node-0/Membuf-36, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C31
| MEMBUF /Sys-0/Node-0/Membuf-37, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C30
| MEMBUF /Sys-0/Node-0/Membuf-38, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C32
| MEMBUF /Sys-0/Node-0/Membuf-39, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C33
| MEMBUF /Sys-0/Node-0/Membuf-4, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C15
| MEMBUF /Sys-0/Node-0/Membuf-44, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C27
| MEMBUF /Sys-0/Node-0/Membuf-45, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C26
| MEMBUF /Sys-0/Node-0/Membuf-46, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C28
| MEMBUF /Sys-0/Node-0/Membuf-47, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C29
| MEMBUF /Sys-0/Node-0/Membuf-5, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C14
| MEMBUF /Sys-0/Node-0/Membuf-52, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C39
| MEMBUF /Sys-0/Node-0/Membuf-53, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C38
| MEMBUF /Sys-0/Node-0/Membuf-54, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C40
| MEMBUF /Sys-0/Node-0/Membuf-55, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C41
| MEMBUF /Sys-0/Node-0/Membuf-6, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C16
| MEMBUF /Sys-0/Node-0/Membuf-60, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C35
| MEMBUF /Sys-0/Node-0/Membuf-61, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C34
| MEMBUF /Sys-0/Node-0/Membuf-62, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C36
| MEMBUF /Sys-0/Node-0/Membuf-63, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C37
| MEMBUF /Sys-0/Node-0/Membuf-7, CENTAUR EC 2.0,
U78C7.001.1AQH383-P2-C17
| PROC /Sys-0/Node-0/Proc-0, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
| CORE /Sys-0/Node-0/Proc-0/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-0/EX-4/Core-0
| PCI /Sys-0/Node-0/Proc-0/PCI-0
| PCI /Sys-0/Node-0/Proc-0/PCI-1
| PCI /Sys-0/Node-0/Proc-0/PCI-2
| PSI /Sys-0/Node-0/Proc-0/PSI-0
| PROC /Sys-0/Node-0/Proc-1, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
| CORE /Sys-0/Node-0/Proc-1/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-1/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-1/PCI-0
| PCI /Sys-0/Node-0/Proc-1/PCI-1
| PCI /Sys-0/Node-0/Proc-1/PCI-2
| PROC /Sys-0/Node-0/Proc-2, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
| CORE /Sys-0/Node-0/Proc-2/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-2/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-2/PCI-0
| PCI /Sys-0/Node-0/Proc-2/PCI-1
| PCI /Sys-0/Node-0/Proc-2/PCI-2
| PSI /Sys-0/Node-0/Proc-2/PSI-0
| PROC /Sys-0/Node-0/Proc-3, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
| CORE /Sys-0/Node-0/Proc-3/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-3/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-3/PCI-0
| PCI /Sys-0/Node-0/Proc-3/PCI-1
| PCI /Sys-0/Node-0/Proc-3/PCI-2
| PROC /Sys-0/Node-0/Proc-4, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
| CORE /Sys-0/Node-0/Proc-4/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-4/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-4/PCI-0
| PCI /Sys-0/Node-0/Proc-4/PCI-1
| PCI /Sys-0/Node-0/Proc-4/PCI-2
| PROC /Sys-0/Node-0/Proc-5, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
| CORE /Sys-0/Node-0/Proc-5/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-5/EX-4/Core-0
| PCI /Sys-0/Node-0/Proc-5/PCI-0
| PCI /Sys-0/Node-0/Proc-5/PCI-1
| PCI /Sys-0/Node-0/Proc-5/PCI-2
| PROC /Sys-0/Node-0/Proc-6, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
| CORE /Sys-0/Node-0/Proc-6/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-4/Core-0
| CORE /Sys-0/Node-0/Proc-6/EX-5/Core-0
| PCI /Sys-0/Node-0/Proc-6/PCI-0
| PCI /Sys-0/Node-0/Proc-6/PCI-1
| PCI /Sys-0/Node-0/Proc-6/PCI-2
| PROC /Sys-0/Node-0/Proc-7, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
| CORE /Sys-0/Node-0/Proc-7/EX-12/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-13/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-14/Core-0
| CORE /Sys-0/Node-0/Proc-7/EX-6/Core-0
| PCI /Sys-0/Node-0/Proc-7/PCI-0
| PCI /Sys-0/Node-0/Proc-7/PCI-1
| PCI /Sys-0/Node-0/Proc-7/PCI-2
|
| System Hardware Summary:
| Configured Proc Cores: 32
| Configured IO UNITs: 24
| Configured PCIe PHB: 24
| Installed Nodes: 1
|
| Hardware InitFile Information:
| No tool support for FIRENZE
|
| Hardware (CINI) Frequency Information:
| No tool support for FIRENZE
|
| VPD Information:
| Backplane VPD:
| None found or VPD info is not available.
| VPD LID Information:
| VPD LID File [/opt/extucode/80e00040.lid]:
| VPD Keyword: [LX], Data: [3100050100300040]
| VPD LID File [/opt/extucode/80e00041.lid]:
| VPD Keyword: [LX], Data: [3100040100300041]
| VPD LID File [/opt/extucode/80e00042.lid]:
| VPD Keyword: [LX], Data: [3100040100300042]
| VPD LID File [/opt/extucode/80e00043.lid]:
| VPD Keyword: [LX], Data: [3100040100300043]
| VPD LID File [/opt/extucode/80e00044.lid]:
| VPD Keyword: [LX], Data: [3100040100300044]
| VPD LID File [/opt/extucode/80e00047.lid]:
| VPD Keyword: [LX], Data: [3100040100300047]
| Format: 0x31 (1)
| Enclosure ID: 0x0004 (P8 HV (Tuleta))
| Server Type: 0x01 (i/pSeries)
| FRU Type: 0x00 (Backplane)
| VPD Pass: 0x30 (0)
| LID Name: 0x0047 (P8 Alpine xS4U)
| VPD LID File [/opt/extucode/80e00050.lid]:
| VPD Keyword: [LX], Data: [3100060100300050]
| VPD LID File [/opt/extucode/80e00051.lid]:
| VPD Keyword: [LX], Data: [3100060100300051]
| VPD LID File [/opt/extucode/80e00942.lid]:
| VPD Keyword: [LX], Data: [3100040100300942]
| VPD LID File [/opt/extucode/80e00944.lid]:
| VPD Keyword: [LX], Data: [3100040100300944]
| VPD LID File [/opt/extucode/80e00947.lid]:
| VPD Keyword: [LX], Data: [3100040100300947]
| Format: 0x31 (1)
| Enclosure ID: 0x0004 (P8 HV (Tuleta))
| Server Type: 0x01 (i/pSeries)
| FRU Type: 0x00 (Backplane)
| VPD Pass: 0x30 (0)
| LID Name: 0x0947 (P8 Alpine Storage/Shark)
| VPD LID File [/opt/extucode/80e00ff0.lid]:
| VPD Keyword: [LX], Data: [3100040100300FF0]
|
| WARNINGS:
| * Informational: This machine has signed firmware (ship image)
|
| ERRL: Attempting to dump error logs using errl...
| Dumping all error logs on FSP to file...
| ERRL: The FSP stopped responding... skipping
|
| FFDC:
| FNM: Attempting connection for basic health check...
| TimeSincePhypStarted=82:13:57.539
| No failed tasks found.
|
| FNM: Attempting connection for PHYP FFDC...
| FNM PHYP FFDC data stored in
/fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
|
| FipS MyFFDC: Was not attempted. Reason:[Not requested]
|
| Cronus: Data collection not attempted. (Unable to use Cronus via SSH
Tunnel)
|
|----- File(s) Created During Capture ------
| SysCapture Primary LogFile:
/fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537
| FNM PHYP FFDC stored in:
/fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
|
============== end of capture ==============
============================Manual entries===========================
Title: STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times
during boot then disabled SRC BA188002:b0314a_1612.840
Problem Description :
Booting Ubuntu 16.04 with Blufin (SAN) and several other adapters, Bluefin
EEH 6 times and then disabled, SRC BA188002 reported. All other adapters did
not have any issues.
===================================END===============================
==== State: Open by: mlfield on 21 March 2016 14:47:26 ====
Attached Dmesg Log: dmesg1.txt
mlfield ([email protected]) added native attachment
/opt/IBM/WebSphere/AppServer/profiles/cqweb/temp/ausratsrv5Node01/server1/TeamEAR/cqweb.war/dmesg1.txt
on 2016-03-21 14:47:26
== Comment: #2 - Application Cdeadmin <[email protected]> -
2016-03-21 15:55:16 ==
== Comment: #12 - Mauricio Faria De Oliveira <[email protected]> -
2016-04-04 14:09:48 ==
Info from Mike on ST.
Assigned the adapter in the drawer to the LPAR, it hit the problem just like
the adapter in the CEC.
This points to a kernel/driver problem, since 14.04 didn't hit the problem.
[email protected] - Michael Littlefield/Austin/Contr/IBM: just added both
bluefins, its happen with both so MEX and CEC.
# Slot Description
Device(s)
U78C7.001.1AQH383-P1-C4 PCI-E capable, Rev 3, 16x lanes with 16x lanes
connected fibre-channel
fibre-channel
U78C7.001.1AQH383-P1-C6 PCI-E capable, Rev 3, 8x lanes with 8x lanes
connected 0000:60:00.1
0000:60:00.0
U78CD.001.FZH0132-P1-C1 PCI-E capable, Rev 3, 16x lanes with 16x lanes
connected fibre-channel
fibre-channel
U78CD.001.FZH0132-P2-C1 PCI-E capable, Rev 3, 16x lanes with 16x lanes
connected 0002:50:00.0
U78CD.001.FZH0132-P2-C3 PCI-E capable, Rev 3, 8x lanes with 8x lanes
connected 0003:70:00.0
U78CD.001.FZH0132-P2-C6 PCI-E capable, Rev 3, 8x lanes with 8x lanes
connected 0004:a0:00.5
0004:a0:00.4
0004:a0:00.3
0004:a0:00.2
0004:a0:00.1
0004:a0:00.0
== Comment: #16 - Mauricio Faria De Oliveira <[email protected]> -
2016-04-12 18:00:26 ==
Mike provided the LPAR for debugging earlier today.
Observations.
1) The NUMA nodes configuration is weird -- likely an effect of DLPAR of
Memory/CPU.
- node 0: has CPUs but has no memory
- node 1: has CPUs and memory
- node 6: has no CPUs but has memory
(0) root @ alp7p04: /root
# numactl -H
available: 3 nodes (0,2,6)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 size: 34216 MB
node 2 free: 33248 MB
node 6 cpus:
node 6 size: 6644 MB
node 6 free: 6568 MB
node distances:
node 0 2 6
0: 10 40 40
2: 40 10 40
6: 40 40 10
2) The problem does not reproduce with 14.04 kernel (4.2 from wily).
Comparing the dmesg logs up to the point of failure, there are differences in
the NUMA setup code.
2a) A small offset difference in the NUMA DATA starting address. For example:
16.04: [ 0.000000] numa: NODE_DATA [mem 0x9ffe46100-0x9ffe4ffff]
14.04: [ 0.000000] numa: NODE_DATA [mem 0x9ffe45000-0x9ffe4ffff]
2b) A *totally* different end address in the "Initmem setup node 0"
16:04: [ 0.000000] Initmem setup node 0 [mem
0x0000000000000000-0x0000000000000000]
14.04: [ 0.000000] Initmem setup node 0 [mem
0x0000000000000000-0xffffffffffffffff]
In progress.
I'll go through the NUMA setup code.
== Comment: #20 - Mauricio Faria De Oliveira <[email protected]> -
2016-04-12 18:18:52 ==
Booting the 16.04 kernel with the numa=off boot option.
The EEH errors still happen, but at a very later time (e.g., the 6th
error/permanent failure happens only after the login prompt)
== Comment: #22 - Mauricio Faria De Oliveira <[email protected]> -
2016-04-13 10:23:33 ==
(In reply to comment #16)
> 2b) A *totally* different end address in the "Initmem setup node 0"
>
> 16:04: [ 0.000000] Initmem setup node 0 [mem
> 0x0000000000000000-0x0000000000000000]
>
> 14.04: [ 0.000000] Initmem setup node 0 [mem
> 0x0000000000000000-0xffffffffffffffff]
And this is the value on the original/reported dmesg attachment (on
different NUMA node configuration, before some memory and CPUs were
moved from this LPAR to another one):
[Mon Mar 21 09:07:45 2016] Initmem setup node 0 [mem
0x0000000000000000-0x00000078cfffffff]
Notice it's non-zero as well as 14.04.. so not sure the NUMA
differences have something directly related to this bug.
== Comment: #27 - Mauricio Faria De Oliveira <[email protected]> -
2016-05-18 19:47:05 ==
Assigning this bug to Guilherme per EEH debugging experience and contacts.
From what we've discussed, this problem doesn't seem to be specific to the
lpfc device driver.
This same adapter/driver works fine on other systems (it has passed our FVT
Regression testing w/out this problem).
So, we suspect of some changes either in EEH / machine/platform-dependent
code that is causing this, given that the 14.04 HWE kernel doesn't show this
issue on this same LPAR.
== Comment: #30 - Guilherme Guaglianoni Piccoli <[email protected]> -
2016-05-25 16:35:50 ==
Quick update on this one: I'm investigating since Monday, and what I found is
that in those cases of spontaneous EEH, the PCI BARs of the device are
fulfilled with 0xFF, indicating some kind of corruption in adapter's memory.
To dump the PCI BARs I firstly booted without EEH (by using eeh=off).
The problem reproduces on kernel upstream v4.5, but not in v4.4 - so
it seems a regression.
I'm studying the commits between those revisions, making bisects,
etc...so we can find which commits introduced this behavior.
Thanks,
Guilherme
== Comment: #31 - Guilherme Guaglianoni Piccoli <[email protected]> -
2016-05-27 18:59:09 ==
Offending commit was found after doing some bisect and analysis on upstream
kernel:
d6de08cc462 ("lpfc: Fix the FLOGI discovery logic to comply with T11
standards")
When this comment was reverted in kernel 4.6, the problem disappeared.
I do see some FLOGI failure on dmesg, but I guess this is somewhat normal
(reference: https://access.redhat.com/solutions/400483);
Now, next step is to investigate what's going on with this commit; it should
has been tested before it was merged, so this could be a non-expected corner
case we're experiencing. I guess Maur?cio's opinion would be really useful
here, since he has much expertise in Fiber Channel devices (he should be back
on next week's beginning).
One more thought: it's important to determine what is the real priority of
this bug, meaning if this is a stop ship or the impact on some release would be
critical, we could ask Canonical to revert it until a proper fix be
implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.
Thanks,
Guilherme
== Comment: #32 - Mauricio Faria De Oliveira <[email protected]> -
2016-05-30 10:13:57 ==
Guilherme,
Thank you very much for the precise handling on this one. Reassigning
it back to myself.
I wouldn't imagine this was a driver specific problem, but given your
pointer to this commit, it's indeed something in that direction -- the
dmesg log confirm there's some involvement of the FLOGI (fabric login)
steps (related to the mentioned commit)
The devices have 2 ports (eg, PCI functions 0 and 1).
- Function 0 is processed first -- probe finishes OK, and it starts FLOGI
steps.
- Function 1 starts probe during Function 0's FLOGI steps -- and Function 1
probe fails on with the EEH.
So, the change in the FLOGI logic seems to be quite involved in the
problems sensed by the mailbox commands that result in the EEH.
More on this later.
[ 1.215858] lpfc 0001:01:00.0: enabling device (0144 -> 0146)
...
[ 2.143487] lpfc 0001:01:00.1: enabling device (0144 -> 0146)
...
[ 2.636592] lpfc 0001:01:00.0: 0:1303 Link Up Event x1 received Data: x1
x0 x80 x0 x0 x0 0
[ 2.638459] lpfc 0001:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103
TMO:x14 Data x1800 x0
[ 2.638464] lpfc 0001:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103
TMO:x14
[ 2.639019] EEH: Frozen PHB#1-PE#10000 detected
...
[ 2.639049] [c00000084f612ee0] [c000000000037a84]
eeh_check_failure+0x84/0xd0
[ 2.639061] [c00000084f612f20] [d000000008ed3cc4]
lpfc_sli4_wait_bmbx_ready+0x114/0x150 [lpfc]
...
[ 2.639086] [c00000084f6131c0] [d000000008ee7780]
lpfc_cq_create+0x210/0x370 [lpfc].
...
[ 2.639113] [c00000084f613550] [d000000008f23a28]
lpfc_pci_probe_one+0x1248/0x13d0 [lpfc]
[ 2.639117] [c00000084f6135f0] [c0000000005daefc]
local_pci_probe+0x6c/0x140
...
[ 2.639158] lpfc 0001:01:00.1: 1:(0):2544 Mailbox command x9b (x1/xc)
cannot issue Data: x200 x1
...
[ 2.639166] lpfc 0001:01:00.1: 1:2501 CQ_CREATE mailbox failed with status
x0 add_status x0, mbx status xff
...
== Comment: #33 - Guilherme Guaglianoni Piccoli <[email protected]> -
2016-05-30 12:56:21 ==
Thanks Maur?cio!
I noticed compiling kernel both with the commit and without it (by
reverting it), the following if is taken on lpfc_mbox_dev_check() :
if (phba->link_state == LPFC_HBA_ERROR)
So, in both cases the link_state is off but the commit perhaps introduced
some order re-arrangement in the way it cannot handle anymore with this fail,
maybe because of a race condition between threads.
This conclusion came from the following snippet of commit message:
"Required reworking the call sequence in the discovery threads."
Thanks for taking from now.
Cheers,
Guilherme
== Comment: #34 - Breno Henrique Leitao <[email protected]> - 2016-05-30
13:25:00 ==
> we could ask Canonical to revert it until a proper fix be
> implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.
Well, it will not be simple to ask them to revert it. Although we
requested the lpfc package upgrade [via bug #132388], there was
another request to do so (LP: #1541592), so, I would suggest trying to
propose a fix, other than asking to revert this commit.
Does it make sense?
== Comment: #35 - Mauricio Faria De Oliveira <[email protected]> -
2016-05-30 14:17:25 ==
It seems this commit might fix the problem. I'm working on a build with it.
ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector
Driver didn't program the REG_VFI mailbox correctly, giving the adapter
bad addresses.
== Comment: #36 - Mauricio Faria De Oliveira <[email protected]> -
2016-05-30 17:35:30 ==
Hi Canonical,
Can you please apply this fix for the lpfc driver?
This upstream commit fixes the problem:
ae09c765109293b600ba9169aa3d632e1ac1a843
lpfc: Fix DMA faults observed upon plugging loopback connector
Original kernel (4.4.0-22.40)
root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:35 UTC
2016 ppc64le ppc64le ppc64le GNU/Linux
root@alp7p04:~# dmesg | grep -i eeh
[ 0.051252] EEH: pSeries platform initialized
[ 0.137050] EEH: devices created
[ 0.167121] EEH: PCI Enhanced I/O Error Handling Enabled
[ 3.039195] EEH: Frozen PHB#3-PE#10000 detected
[ 3.039211] EEH: PE location: N/A, PHB location: N/A
[ 3.039234] [c00000062fa16e40] [c0000000000379b4]
eeh_dev_check_failure+0x534/0x580
[ 3.039237] [c00000062fa16ee0] [c000000000037a84]
eeh_check_failure+0x84/0xd0
[ 3.039398] EEH: Detected PCI bus error on PHB#3-PE#10000
<...>
Patched kernel (4.4.0-22.40 + patch)
root@alp7p04:~# uname -a
Linux alp7p04 4.4.0-22-generic #40+bz139414c35 SMP Mon May 30 10:54:04
CDT 2016 ppc64le ppc64le ppc64le GNU/Linux
root@alp7p04:~# dmesg | grep -i eeh
[ 0.051222] EEH: pSeries platform initialized
[ 0.137348] EEH: devices created
[ 0.167359] EEH: PCI Enhanced I/O Error Handling Enabled
root@alp7p04:~#
== Comment: #38 - Mauricio Faria De Oliveira <[email protected]> -
2016-05-30 17:42:13 ==
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1587316/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp