> On 21 Oct 2016, at 16:49, Rob Gardner <[email protected]> wrote: > > On 10/21/2016 06:57 AM, Anatoly Pugachev wrote: >> On Fri, Oct 21, 2016 at 12:12 PM, Anatoly Pugachev <[email protected]> >> wrote: >>> On Wed, Sep 7, 2016 at 1:01 PM, Anatoly Pugachev <[email protected]> wrote: >>>> On Wed, Sep 7, 2016 at 12:22 PM, John Paul Adrian Glaubitz >>>> <[email protected]> wrote: >>>>> Hello! >>>>> >>>>> After kernel 4.7.2 entered Debian unstable, I decided to upgrade the >>>>> buildds and ran into an >>>>> apparent regression with the 4.7.x kernels on sun4u machines: >>>> It's not only with sun4u, we're getting kernel OOPS on sun4v as well: >>> debian packaged 4.7.6 kernel, machine is a LDOM on T5-2 server, OOPS >>> after kernel boot within a few minutes: >> >> reproduced with latest git 4.9.0-rc1+ (v4.9-rc1-148-g6f33d645) kernel. >> Machine boots ok, i can login as unprivileged user (via ssh), compile >> and install kernel, run sudo, install packages (apt upgrade), >> apache/mysql and other startup daemons works, but if I try to login as >> root via ssh, it throws kernel oops / illegal instruction. >> >> Any idea how to debug this? >> >> otherhost$ ssh ttip -l root -v >> ... >> debug1: channel 0: new [client-session] >> debug1: Requesting [email protected] >> debug1: Entering interactive session. >> Write failed: Broken pipe >> $ >> >> I can strace -f -p $pid_of_sshd , but not sure it would help. >> >> URL version => http://paste.debian.net/plain/884751 >> kernel config => http://paste.debian.net/plain/884806 >> >> NOTICE: Entering OpenBoot. >> NOTICE: Fetching Guest MD from HV. >> NOTICE: Starting additional cpus. >> NOTICE: Initializing LDC services. >> NOTICE: Probing PCI devices. >> NOTICE: Finished PCI probing. >> >> SPARC T5-2, No Keyboard >> Copyright (c) 1998, 2016, Oracle and/or its affiliates. All rights reserved. >> OpenBoot 4.38.5, 32.0000 GB memory available, Serial #83494642. >> Ethernet address 0:14:4f:fa:6:f2, Host ID: 84fa06f2. >> >> >> >> Boot device: vdisk1 File and args: >> SILO Version 1.4.14 >> boot: >> Allocated 64 Megs of memory at 0x40000000 for kernel >> Uncompressing image... >> Loaded kernel version 4.9.0 >> Loading initial ramdisk (13616359 bytes at 0x74000000 phys, 0x40C00000 >> virt)... >> >> [ 0.000000] PROMLIB: Sun IEEE Boot Prom 'OBP 4.38.5 2016/06/22 19:36' >> [ 0.000000] PROMLIB: Root node compatible: sun4v >> [ 0.000000] Linux version 4.9.0-rc1+ (mator@ttip) (gcc version >> 6.2.0 20161010 (Debian 6.2.0-6+sparc64) ) #19 SMP Fri Oct 21 14:47:01 >> MSK 2016 >> [ 0.000000] bootconsole [earlyprom0] enabled >> [ 0.000000] ARCH: SUN4V >> ... snip ... >> [5446612.115339] dbus-daemon(521): Kernel illegal instruction [#3] >> [5446612.115342] CPU: 15 PID: 521 Comm: dbus-daemon Tainted: G D >> 4.9.0-rc1+ #19 >> [5446612.115347] task: fff800080b331bc0 task.stack: fff80007f937c000 >> [5446612.115349] TSTATE: 0000004411001606 TPC: 00000000005ccfec TNPC: >> 00000000005ccff0 Y: 00000000 Tainted: G D >> [5446612.115353] TPC: <__kmalloc_track_caller+0x14c/0x240> >> [5446612.115355] g0: fff800080fb28b00 g1: 0000000000400000 g2: >> 0000000000000000 g3: 00000000c0000000 >> [5446612.115357] g4: fff800080b331bc0 g5: fff800082c5b0000 g6: >> fff80007f937c000 g7: 0000000000003c06 >> [5446612.115358] o0: 0000000000000000 o1: 00000000025106c0 o2: >> 000000005a5a5a5a o3: fff800080fb28b00 >> [5446612.115360] o4: 5a5a5a5a5a5a5a5a o5: 0000000000000028 sp: >> fff80007f937eda1 ret_pc: 00000000005ccfe4 >> [5446612.115362] RPC: <__kmalloc_track_caller+0x144/0x240> >> [5446612.115365] l0: fff8000030402800 l1: 000007feffe44e40 l2: >> 000007feffe452b0 l3: 0000000000000000 >> [5446612.115367] l4: 0000000000000000 l5: 0000000000000020 l6: >> fff8000100b875c8 l7: fff800010026bf30 >> [5446612.115368] i0: 0000000000000240 i1: 00000000025106c0 i2: >> 0000000000864e00 i3: 00000000025106c0 >> [5446612.115371] i4: 0000000000000000 i5: 00000000025106c0 i6: >> fff80007f937ee51 i7: 0000000000864d40 >> [5446612.115376] I7: <__kmalloc_reserve.isra.5+0x20/0x80> >> [5446612.115376] Call Trace: >> [5446612.115378] [0000000000864d40] __kmalloc_reserve.isra.5+0x20/0x80 >> [5446612.115381] [0000000000864e00] __alloc_skb+0x60/0x180 >> [5446612.115383] [0000000000864f68] alloc_skb_with_frags+0x48/0x1c0 >> [5446612.115390] [000000000085f54c] sock_alloc_send_pskb+0x1ec/0x220 >> [5446612.115400] [00000000009367a8] unix_stream_sendmsg+0x228/0x380 >> [5446612.115404] [0000000000859ddc] sock_sendmsg+0x3c/0x80 >> [5446612.115406] [000000000085a810] ___sys_sendmsg+0x250/0x260 >> [5446612.115409] [000000000085b794] __sys_sendmsg+0x34/0x80 >> [5446612.115411] [000000000085b800] SyS_sendmsg+0x20/0x40 >> [5446612.115415] [00000000004061f4] linux_sparc_syscall+0x34/0x44 >> [5446612.115417] Caller[0000000000864d40]: __kmalloc_reserve.isra.5+0x20/0x80 >> [5446612.115419] Caller[0000000000864e00]: __alloc_skb+0x60/0x180 >> [5446612.115423] Caller[0000000000864f68]: alloc_skb_with_frags+0x48/0x1c0 >> [5446612.115425] Caller[000000000085f54c]: sock_alloc_send_pskb+0x1ec/0x220 >> [5446612.115428] Caller[00000000009367a8]: unix_stream_sendmsg+0x228/0x380 >> [5446612.115430] Caller[0000000000859ddc]: sock_sendmsg+0x3c/0x80 >> [5446612.115433] Caller[000000000085a810]: ___sys_sendmsg+0x250/0x260 >> [5446612.115435] Caller[000000000085b794]: __sys_sendmsg+0x34/0x80 >> [5446612.115437] Caller[000000000085b800]: SyS_sendmsg+0x20/0x40 >> [5446612.115439] Caller[00000000004061f4]: linux_sparc_syscall+0x34/0x44 >> [5446612.115442] Caller[fff800010081770c]: 0xfff800010081770c >> [5446612.115444] Instruction DUMP: >> [5446612.115445] ba100008 >> [5446612.115446] 400f1d4f >> [5446612.115447] 01000000 >> [5446612.115447] <3ffffff2> >> [5446612.115448] 01000000 >> [5446612.115450] 106fffbe >> [5446612.115451] 01000000 >> [5446612.115452] c611a036 >> [5446612.115452] 05002c16 >> [5446612.115452] >> [5446612.115778] Caller[00000000005f9ed4]: SyS_mkdir+0x14/0x40 >> [5446612.115791] Caller[00000000004061f4]: linux_sparc_syscall+0x34/0x44 >> [5446612.115802] Caller[fff80001001ef870]: 0xfff80001001ef870 >> [5446612.115818] Instruction DUMP:[5446612.115823] ba100008 >> 400f1baf [5446612.115839] 01000000 >> <3ffffff2>[5446612.115852] 01000000 >> 106fffbe [5446612.115866] 01000000 >> c611a036 [5446612.115879] 05002c16 >> [5446612.115892] >> [5446612.115902] Fixing recursive fault but reboot is needed! > > > In the instruction dump, the offending instruction is always 3ffffff2, and > according the the opcode map, this is some kind of Fujitsu Athena instruction > which probably ought to never be generated by gcc. Can you check to see if > this instruction is in your vmlinux file? Do 'objdump -d vmlinux' and go to > the addresses shown in TPC in the dump (ie, 00000000005ccfe) and see what's > there. If you see 3ffffff2, then somehow some bogus instruction made it into > the vmlinux executable. If you see something else, then it means that the > instruction got changed in memory after the system was booted. That could be > either a stray memory write or a boot time patch gone wrong. Either way, it > may help narrow down the problem.
Hi Rob, They are definitely NOPs in vmlinux being clobbered at load/runtime. According to "gdb vmlinux", the call to _cond_resched is coming from mm/slab.h slab_pre_alloc_hook (the call to might_sleep_if). What's the best way to get a backtrace for writes to this address? Regards, James

