** Also affects: ubuntu-power-systems
Importance: Undecided
Status: New
** Changed in: ubuntu-power-systems
Status: New => Triaged
** Changed in: ubuntu-power-systems
Importance: Undecided => High
** Changed in: ubuntu-power-systems
Assignee: (unassigned) => Canonical Kernel Team (canonical-kernel-team)
** Tags added: triage-g
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1767927
Title:
ISST-LTE:pKVM:Ubuntu1804: rcu_sched self-detected stall on CPU follow
by CPU ATTEMPT TO RE-ENTER FIRMWARE!
Status in The Ubuntu-power-systems project:
Triaged
Status in linux package in Ubuntu:
New
Bug description:
== Comment: #0 - Application Cdeadmin <[email protected]> -
2018-03-20 14:10:53 ==
== Comment: #1 - Application Cdeadmin <[email protected]> - 2018-03-20
14:10:54 ==
== Comment: #2 - Application Cdeadmin <[email protected]> - 2018-03-20
14:10:56 ==
------- Comment From dougmill-ibm 2018-03-20 13:51:47 EDT -------
This problem is not tied to a Linux distro. It will be fixed in firmware, as
I understand it. Let us close any redundant issues for this same problem. Mark
them as duplicate.
== Comment: #3 - Application Cdeadmin <[email protected]> - 2018-03-20
15:50:54 ==
------- Comment From mzipse 2018-03-20 15:44:26 EDT -------
@stewart-ibm @svaidy , I need to you take a first look. The stop fixes that
Vaidy had previously highlighted in a recent note are included in the 3/15 PNOR.
== Comment: #5 - Application Cdeadmin <[email protected]> - 2018-04-04
16:10:56 ==
------- Comment From haochanh 2018-04-04 16:04:07 EDT -------
We update to 0330, bmc=1.18, then we hit bug 1134. Currently we are running
with disable stop5 but still see the watchdog: hard lockup.
After 2 hours of test run, I am seeing the "Watchdog: Lockup' and "became
unstuck"
****************************
[Wed Apr 4 13:38:25 2018] Watchdog CPU:42 Hard LOCKUP
[Wed Apr 4 13:38:25 2018] Modules linked in: vhost_net vhost macvtap macvlan
tap xfs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
rpcsec_gss_krb5 nfsv4 nfs fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE)
ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) esp6_offload esp6 esp4_offload
esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) cxl
pnv_php mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) devlink
mlx_compat(OE) kvm_hv kvm binfmt_misc dm_service_time dm_multipath scsi_dh_rdac
scsi_dh_emc scsi_dh_alua input_leds joydev mac_hid idt_89hpesx ipmi_powernv
[Wed Apr 4 13:38:25 2018] vmx_crypto ipmi_devintf at24 ofpart
uio_pdrv_genirq cmdlinepart uio powernv_flash ipmi_msghandler mtd
crct10dif_vpmsum opal_prd ibmpowernv nfsd sch_fq_codel auth_rpcgss nfs_acl
lockd grace sunrpc knem(OE) ip_tables x_tables autofs4 btrfs xor zstd_compress
raid6_pq ses enclosure scsi_transport_sas hid_generic usbhid hid lpfc ast
i2c_algo_bit ttm drm_kms_helper nvmet_fc syscopyarea sysfillrect nvmet
sysimgblt fb_sys_fops nvme_fc nvme_fabrics crc32c_vpmsum drm i40e
scsi_transport_fc aacraid [last unloaded: mlxfw]
[Wed Apr 4 13:38:25 2018] CPU: 42 PID: 0 Comm: swapper/42 Tainted: G
OE 4.15.0-12-generic #13
[Wed Apr 4 13:38:25 2018] NIP: c0000000000a3ca4 LR: c0000000000a3ca4 CTR:
c000000000008000
[Wed Apr 4 13:38:25 2018] REGS: c000000ff596fc40 TRAP: 0100 Tainted: G
OE (4.15.0-12-generic)
[Wed Apr 4 13:38:25 2018] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR:
24004482 XER: 20040000
[Wed Apr 4 13:38:25 2018] CFAR: c000000ff596fda0 SOFTE: 42
GPR00: c0000000000a3ca4 c000000ff596fda0
c0000000016eb200 c000000ff596fc40
GPR04: b000000000001033 c0000000000a3690
0000000024004484 0000000ffa450000
GPR08: 0000000000000001 c000000000d10ed8
00000000000000ff 0000000000000000
GPR12: 9000000000121033 c000000007a3ce00
c000000ff596ff90 0000000000000000
GPR16: 0000000000000000 c000000000047840
c000000000047810 c0000000011b5380
GPR20: 0000000000000800 c000000001722484
000000000000002a 0000000000000000
GPR24: 00000000000000a8 0000000000000007
0000000000000000 0000000000000007
GPR28: c00000000161d270 c000000ffb666fd8
c00000000161d528 0000000000000007
[Wed Apr 4 13:38:25 2018] NIP [c0000000000a3ca4] power9_idle_type+0x24/0x40
[Wed Apr 4 13:38:25 2018] LR [c0000000000a3ca4] power9_idle_type+0x24/0x40
[Wed Apr 4 13:38:25 2018] Call Trace:
[Wed Apr 4 13:38:25 2018] [c000000ff596fda0] [c0000000000a3ca4]
power9_idle_type+0x24/0x40 (unreliable)
[Wed Apr 4 13:38:25 2018] [c000000ff596fdc0] [c000000000ad1240]
stop_loop+0x40/0x5c
[Wed Apr 4 13:38:25 2018] [c000000ff596fdf0] [c000000000acd9a4]
cpuidle_enter_state+0xa4/0x450
[Wed Apr 4 13:38:25 2018] [c000000ff596fe50] [c00000000017195c]
call_cpuidle+0x4c/0x90
[Wed Apr 4 13:38:25 2018] [c000000ff596fe70] [c000000000171d70]
do_idle+0x2b0/0x330
[Wed Apr 4 13:38:25 2018] [c000000ff596fec0] [c000000000172028]
cpu_startup_entry+0x38/0x50
[Wed Apr 4 13:38:25 2018] [c000000ff596fef0] [c000000000049c30]
start_secondary+0x4f0/0x510
[Wed Apr 4 13:38:25 2018] [c000000ff596ff90] [c00000000000aa6c]
start_secondary_prolog+0x10/0x14
[Wed Apr 4 13:38:25 2018] Instruction dump:
[Wed Apr 4 13:38:25 2018] ebe1fff8 7c0803a6 4e800020 3c4c0164 38427580
7c0802a6 60000000 7c0802a6
[Wed Apr 4 13:38:25 2018] f8010010 f821ffe1 4bfff97d 4bf732d9 <60000000>
38210020 e8010010 7c0803a6
[Wed Apr 4 13:38:25 2018] Watchdog CPU:43 Hard LOCKUP
[Wed Apr 4 13:38:25 2018] Modules linked in: vhost_net vhost macvtap macvlan
tap xfs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
rpcsec_gss_krb5 nfsv4 nfs fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE)
ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) esp6_offload esp6 esp4_offload
esp4 xfrm_algo mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) cxl
pnv_php mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) devlink
mlx_compat(OE) kvm_hv kvm binfmt_misc dm_service_time dm_multipath scsi_dh_rdac
scsi_dh_emc scsi_dh_alua input_leds joydev mac_hid idt_89hpesx ipmi_powernv
[Wed Apr 4 13:38:25 2018] vmx_crypto ipmi_devintf at24 ofpart
uio_pdrv_genirq cmdlinepart uio powernv_flash ipmi_msghandler mtd
crct10dif_vpmsum opal_prd ibmpowernv nfsd sch_fq_codel auth_rpcgss nfs_acl
lockd grace sunrpc knem(OE) ip_tables x_tables autofs4 btrfs xor zstd_compress
raid6_pq ses enclosure scsi_transport_sas hid_generic usbhid hid lpfc ast
i2c_algo_bit ttm drm_kms_helper nvmet_fc syscopyarea sysfillrect nvmet
sysimgblt fb_sys_fops nvme_fc nvme_fabrics crc32c_vpmsum drm i40e
scsi_transport_fc aacraid [last unloaded: mlxfw]
[Wed Apr 4 13:38:25 2018] CPU: 43 PID: 0 Comm: swapper/43 Tainted: G
OE 4.15.0-12-generic #13
[Wed Apr 4 13:38:25 2018] NIP: c0000000000a3ca4 LR: c0000000000a3ca4 CTR:
c000000000008000
[Wed Apr 4 13:38:25 2018] REGS: c000000ff597fc40 TRAP: 0100 Tainted: G
OE (4.15.0-12-generic)
[Wed Apr 4 13:38:25 2018] MSR: 9000000000001033 <SF,HV,ME,IR,DR,RI,LE> CR:
24004482 XER: 00000000
[Wed Apr 4 13:38:25 2018] CFAR: c000000ff597fda0 SOFTE: 43
GPR00: c0000000000a3ca4 c000000ff597fda0
c0000000016eb200 c000000ff597fc40
GPR04: b000000000001033 c0000000000a3690
0000000024004484 ffffffffffffffbf
GPR08: 000000000000007f c000000000d10ed8
00000000000000ff ffffffffffffffdf
GPR12: 9000000000121033 c000000007a3d900
c000000ff597ff90 0000000000000000
GPR16: 0000000000000000 c000000000047840
c000000000047810 c0000000011b5380
GPR20: 0000000000000800 c000000001722484
000000000000002b 0000000000000000
GPR24: 00000000000000ac 0000000000000007
0000000000000000 0000000000000007
GPR28: c00000000161d270 c000000ffb6a6fd8
c00000000161d528 0000000000000007
[Wed Apr 4 13:38:25 2018] NIP [c0000000000a3ca4] power9_idle_type+0x24/0x40
[Wed Apr 4 13:38:25 2018] LR [c0000000000a3ca4] power9_idle_type+0x24/0x40
[Wed Apr 4 13:38:25 2018] Call Trace:
[Wed Apr 4 13:38:25 2018] [c000000ff597fda0] [c0000000000a3ca4]
power9_idle_type+0x24/0x40 (unreliable)
[Wed Apr 4 13:38:25 2018] [c000000ff597fdc0] [c000000000ad1240]
stop_loop+0x40/0x5c
[Wed Apr 4 13:38:25 2018] [c000000ff597fdf0] [c000000000acd9a4]
cpuidle_enter_state+0xa4/0x450
[Wed Apr 4 13:38:25 2018] [c000000ff597fe50] [c00000000017195c]
call_cpuidle+0x4c/0x90
[Wed Apr 4 13:38:25 2018] [c000000ff597fe70] [c000000000171d70]
do_idle+0x2b0/0x330
[Wed Apr 4 13:38:25 2018] [c000000ff597fec0] [c000000000172028]
cpu_startup_entry+0x38/0x50
[Wed Apr 4 13:38:25 2018] [c000000ff597fef0] [c000000000049c30]
start_secondary+0x4f0/0x510
[Wed Apr 4 13:38:25 2018] [c000000ff597ff90] [c00000000000aa6c]
start_secondary_prolog+0x10/0x14
[Wed Apr 4 13:38:25 2018] Instruction dump:
[Wed Apr 4 13:38:25 2018] ebe1fff8 7c0803a6 4e800020 3c4c0164 38427580
7c0802a6 60000000 7c0802a6
[Wed Apr 4 13:38:25 2018] f8010010 f821ffe1 4bfff97d 4bf732d9 <60000000>
38210020 e8010010 7c0803a6
[Wed Apr 4 13:38:27 2018] Watchdog CPU:42 became unstuck
[Wed Apr 4 13:38:27 2018] Watchdog CPU:41 became unstuck
[Wed Apr 4 13:38:27 2018] Watchdog CPU:43 became unstuck
== Comment: #6 - Application Cdeadmin <[email protected]> - 2018-04-04
16:50:56 ==
------- Comment From youhour 2018-04-04 16:44:55 EDT -------
pegas 1.1 seems to fix my problem above by @haochanh. Upgrade your OS and
see if that will help.
== Comment: #7 - Michael Neuling <[email protected]> - 2018-04-05
16:30:31 ==
So we've seen something similar on on other bugs (like
https://github.com/open-power/boston-openpower/issues/1084#issuecomment-377122303)
It's looks like we may have taken an RCU stall which causes an NMI
interrupt to be sent to the stalled CPU. This then interrupts a CPU
which is in OPAL, which the kernel doesn't do a good job of recovering
from. There are two patches that can help:
The first one removes the NMI on RCU stalls here
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=47712a921bb781caf69fca9eae43be19968816cb
The second improves the kernel handling of taking an NMI/sreset inside
OPAL http://patchwork.ozlabs.org/patch/886688/ (not upstream).
== Comment: #15 - Application Cdeadmin <[email protected]> - 2018-04-12
08:41:01 ==
------- Comment From youhour 2018-04-12 08:32:21 EDT -------
@mikey Do we have commit for the fix yet?
== Comment: #16 - Gustavo Luiz Ferreira Walbon <[email protected]> -
2018-04-12 13:46:51 ==
All,
There are three of four from the original patchset that were approved
on the upstream. Missing the patch '[RFC,4/4] powerpc/xmon: Detect if
OPAL was interrupted and mark unrecoverable'
(https://patchwork.ozlabs.org/patch/886691/)
[ATTENTION] The ubuntu kernel freeze is coming this week.
[1/4]
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/arch?h=next&id=15b4dd7981496f51c5f9262a5e0761e48de6655f
powerpc/64s: return more carefully from sreset NMI
System Reset, being an NMI, must return more carefully than other
interrupts. It has traditionally returned via the nromal return
from exception path, but that has a number of problems.
- r13 does not get restored if returning to kernel. This is for
interrupts which may cause a context switch, which sreset will
never do. Interrupting OPAL (which uses a different r13) is one
place where this causes breakage.
- It may cause several other problems returning to kernel with
preempt or TIF_EMULATE_STACK_STORE if it hits at the wrong time.
It's safer just to have a simple restore and return, like machine
check which is the other NMI.
[2/4]
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/arch?h=next&id=d40b6768e45bd9213139b2d91d30c7692b6007b1
powerpc/64s: sreset panic if there is no debugger or crash dump handlers
system_reset_exception does most of its own crash handling now,
invoking the debugger or crash dumps if they are registered. If not,
then it goes through to die() to print stack traces, and then is
supposed to panic (according to comments).
However after die() prints oopses, it does its own handling which
doesn't allow system_reset_exception to panic (e.g., it may just
kill the current process). This patch causes sreset exceptions to
return from die after it prints messages but before acting.
This also stops die from invoking the debugger on 0x100 crashes.
system_reset_exception similarly calls the debugger. It had been
thought this was harmless (because if the debugger was disabled,
neither call would fire, and if it was enabled the first call
would return). However in some cases like xmon 'X' command, the
debugger returns 0, which currently causes it to be entered
again (first in system_reset_exception, then in die), which is
confusing.
[3/4]
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/arch?h=next&id=741de617661794246f84a21a02fc5e327bffc9ad
powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
opal_nvram_write currently just assumes success if it encounters an
error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO
on other errors instead.
Fixes: 628daa8d5abf ("powerpc/powernv: Add RTC and NVRAM support plus
RTAS fallbacks")
== Comment: #17 - Application Cdeadmin <[email protected]> - 2018-04-12
17:30:58 ==
------- Comment From youhour 2018-04-12 17:30:25 EDT -------
Stewart mentioned that these patches need to be picked by the distros.
@bwmashak Ben do you know who from the distros needs to be informed?
== Comment: #18 - Michael Y. Lim <[email protected]> - 2018-04-13 10:12:40 ==
Gustavo, please let us know which kernel version has this patch. Thank you!
== Comment: #19 - Gustavo Luiz Ferreira Walbon <[email protected]> -
2018-04-13 14:22:06 ==
(In reply to comment #18)
> Gustavo, please let us know which kernel version has this patch. Thank
you!
Hello Michael,
So, There is no official distro with those patches, they are on upstream yet.
I have generated a build with just asked patch set here which it's based on
the ubuntu kernel v4.15.0-12.13.
http://pokgsa.ibm.com/gsa/pokgsa/home/g/w/gwalbon/web/public/Bug165882/v2/
== Comment: #21 - Benjamin W. Mashak <[email protected]> - 2018-04-24
15:59:09 ==
Gustavo Luiz Ferreira Walbon, what's the outlook to upstream and close this
BZ? Its currently on the must-fix list for upcoming GA in May.
== Comment: #22 - Gustavo Luiz Ferreira Walbon <[email protected]> -
2018-04-25 08:28:01 ==
(In reply to comment #21)
> Gustavo Luiz Ferreira Walbon, what's the outlook to upstream and close this
> BZ? Its currently on the must-fix list for upcoming GA in May.
Benjamin,
This missing patch was a RFC by Nicholas Piggin, as a RFC just the 3
of 4 patches was judged as relevant and they were added to powerpc
tree.
I hope this 3 patches were enough.
== Comment: #24 - Gustavo Luiz Ferreira Walbon <[email protected]> -
2018-04-25 08:34:35 ==
Adding a patch series to fix a CPU lockup in UbuntuKVM 18.04.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1767927/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp