On Thu, May 21, 2020 at 10:14 PM Tibor Billes <tbil...@gmx.com> wrote: > > Hi, > > On Mon, 18 May 2020, Ian Rogers wrote: > > > On Sat, May 16, 2020 at 6:36 AM Billes Tibor <tbil...@gmx.com> wrote: > > > > > > Hi, > > > > > > I've been hitting a freeze on my laptop since 5.3, but haven't got the > > > time to finish bisecting it. Now > > > I had, and here is what I found: > > > > > > - 5.2 series works correctly (tested 5.2.9 and 5.2.15) > > > - 5.3 series and newer kernels freeze. The newest I tested is 5.6.10 > > > (which also freezes). There will > > > be a complete bisect log at the end of the mail. > > > > > > There are several circumstances to reproduce the freeze. At least this > > > is what I found relevant: > > > - The freeze does not occur after a fresh boot, it needs a sleep-wakeup > > > cycle. > > > - Run `perf stat -a --topdown -- sleep 8s` every minute (It is part of > > > some metrics I collect using Zabbix) > > > - Some workload (building the kernel or gaming) increases the chance of > > > freezing, but it can occur > > > without user interaction too. > > > > > > The freeze usually comes within a few minutes after wakeup. The longest > > > was about an hour. (For > > > comparison, if I don't do a sleep-wakeup, the machine works fine for 8+ > > > hours). > > > > > > First, there is a warning in syslog, then I took pictures of the actual > > > panic. > > > > > > The warning: > > > May 16 13:28:46 serpens kernel: [33128.086217] ------------[ cut here > > > ]------------ > > > May 16 13:28:46 serpens kernel: [33128.086222] WARNING: CPU: 0 PID: 0 at > > > arch/x86/events/core.c:1506 x86_pmu_del+0x140/0x160 > > > May 16 13:28:46 serpens kernel: [33128.086223] Modules linked in: > > > nouveau mxm_wmi ttm ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc > > > iptable_filter binfmt_misc essiv authenc uvcvideo iwlmvm > > > videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev mac80211 > > > videobuf2_common snd_hda_codec_realtek snd_hda_codec_generic libarc4 > > > snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep intel_rapl_msr > > > iwlwifi snd_hda_core intel_rapl_common snd_pcm x86_pkg_temp_thermal > > > snd_seq mei_me cfg80211 snd_seq_device snd_timer mei intel_powerclamp > > > snd soundcore ideapad_laptop coretemp sparse_keymap ip_tables x_tables > > > dm_crypt hid_generic usbhid hid i915 intel_gtt i2c_algo_bit > > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm > > > aesni_intel glue_helper crypto_simd r8169 ahci cryptd psmouse libahci > > > realtek drm_panel_orientation_quirks video wmi > > > May 16 13:28:46 serpens kernel: [33128.086247] CPU: 0 PID: 0 Comm: > > > swapper/0 Not tainted 5.6.10 #52 > > > May 16 13:28:46 serpens kernel: [33128.086247] Hardware name: LENOVO > > > 20378/Lenovo Y50-70, BIOS 9ECN36WW(V2.00) 01/12/2015 > > > May 16 13:28:46 serpens kernel: [33128.086249] RIP: > > > 0010:x86_pmu_del+0x140/0x160 > > > May 16 13:28:46 serpens kernel: [33128.086250] Code: 63 d8 48 c7 84 dd > > > 20 07 00 00 00 00 00 00 4c 89 e7 89 85 14 02 00 00 e8 fe 27 16 00 e9 ef > > > fe ff ff 44 8d 6b 01 e9 5d ff ff ff <0f> 0b 5b 5d 41 5c 41 5d c3 31 db > > > e9 41 ff ff ff 41 bd 01 00 00 00 > > > May 16 13:28:46 serpens kernel: [33128.086251] RSP: > > > 0018:ffffa2a380003e40 EFLAGS: 00010046 > > > May 16 13:28:46 serpens kernel: [33128.086252] RAX: 0000000000000005 > > > RBX: 0000000000000005 RCX: 0000000000000010 > > > May 16 13:28:46 serpens kernel: [33128.086253] RDX: 0000000000000005 > > > RSI: 0000000000000005 RDI: ffff907c4d4b1000 > > > May 16 13:28:46 serpens kernel: [33128.086254] RBP: ffff907d932125a0 > > > R08: 0000000000000002 R09: 0000000000029f80 > > > May 16 13:28:46 serpens kernel: [33128.086254] R10: ffffa2a380003eb0 > > > R11: 0000000000000000 R12: ffff907c4d4b1000 > > > May 16 13:28:46 serpens kernel: [33128.086255] R13: 0000000000000006 > > > R14: ffff907d9326fc0c R15: ffff907d9326fb00 > > > May 16 13:28:46 serpens kernel: [33128.086256] FS: > > > 0000000000000000(0000) GS:ffff907d93200000(0000) knlGS:0000000000000000 > > > May 16 13:28:46 serpens kernel: [33128.086256] CS: 0010 DS: 0000 ES: > > > 0000 CR0: 0000000080050033 > > > May 16 13:28:46 serpens kernel: [33128.086257] CR2: 00007f713c5b44b0 > > > CR3: 000000008300a001 CR4: 00000000001606f0 > > > May 16 13:28:46 serpens kernel: [33128.086257] Call Trace: > > > May 16 13:28:46 serpens kernel: [33128.086259] <IRQ> > > > May 16 13:28:46 serpens kernel: [33128.086263] > > > event_sched_out.isra.116+0x89/0x1f0 > > > May 16 13:28:46 serpens kernel: [33128.086264] > > > group_sched_out.part.118+0x55/0xd0 > > > May 16 13:28:46 serpens kernel: [33128.086265] ctx_sched_out+0x207/0x240 > > > May 16 13:28:46 serpens kernel: [33128.086267] > > > perf_mux_hrtimer_handler+0x267/0x310 > > > May 16 13:28:46 serpens kernel: [33128.086269] ? > > > __perf_install_in_context+0x220/0x220 > > > May 16 13:28:46 serpens kernel: [33128.086270] > > > __hrtimer_run_queues+0xfa/0x260 > > > May 16 13:28:46 serpens kernel: [33128.086272] > > > hrtimer_interrupt+0xe5/0x240 > > > May 16 13:28:46 serpens kernel: [33128.086275] ? > > > recalibrate_cpu_khz+0x10/0x10 > > > May 16 13:28:46 serpens kernel: [33128.086278] > > > smp_apic_timer_interrupt+0x62/0x120 > > > May 16 13:28:46 serpens kernel: [33128.086280] > > > apic_timer_interrupt+0xf/0x20 > > > May 16 13:28:46 serpens kernel: [33128.086281] </IRQ> > > > May 16 13:28:46 serpens kernel: [33128.086284] RIP: > > > 0010:cpuidle_enter_state+0xb6/0x3f0 > > > May 16 13:28:46 serpens kernel: [33128.086285] Code: c4 0f 1f 44 00 00 > > > 31 ff e8 d7 bf a9 ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 12 03 00 > > > 00 31 ff e8 0e 74 af ff fb 45 85 ed <0f> 88 ad 02 00 00 49 63 c5 4c 2b > > > 24 24 48 8d 14 40 48 8d 14 90 48 > > > May 16 13:28:46 serpens kernel: [33128.086285] RSP: > > > 0018:ffffffffb5003e58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 > > > May 16 13:28:46 serpens kernel: [33128.086286] RAX: ffff907d9322a700 > > > RBX: ffff907d93232c00 RCX: 000000000000001f > > > May 16 13:28:46 serpens kernel: [33128.086287] RDX: 00001e213bcf5a93 > > > RSI: 0000000033518eeb RDI: 0000000000000000 > > > May 16 13:28:46 serpens kernel: [33128.086288] RBP: ffffffffb50be6a0 > > > R08: 0000000000000002 R09: 0000000000029f80 > > > May 16 13:28:46 serpens kernel: [33128.086288] R10: ffffffffb5003e40 > > > R11: 000000000000004c R12: 00001e213bcf5a93 > > > May 16 13:28:46 serpens kernel: [33128.086289] R13: 0000000000000001 > > > R14: 0000000000000001 R15: ffffffffffffffff > > > May 16 13:28:46 serpens kernel: [33128.086291] ? > > > cpuidle_enter_state+0x99/0x3f0 > > > May 16 13:28:46 serpens kernel: [33128.086292] cpuidle_enter+0x29/0x40 > > > May 16 13:28:46 serpens kernel: [33128.086295] do_idle+0x1ff/0x270 > > > May 16 13:28:46 serpens kernel: [33128.086296] cpu_startup_entry+0x19/0x20 > > > May 16 13:28:46 serpens kernel: [33128.086299] start_kernel+0x4a6/0x4b3 > > > May 16 13:28:46 serpens kernel: [33128.086302] > > > secondary_startup_64+0xa4/0xb0 > > > May 16 13:28:46 serpens kernel: [33128.086303] ---[ end trace > > > 6d25af1bc8d0e9ca ]--- > > > > > > Here are the photos of the panic: https://postimg.cc/gallery/Ywm0Ywd > > > (Same panic, just taken > > > multiple photos to be readable). > > > > > > Here is another photo of a panic from another machine: > > > https://i.postimg.cc/d00s8Ncb/DSC-0395.jpg > > > Same setup regarding perf stat -a and the sleep-wakeup, but can't > > > remember the exact kernel > > > version I was running. > > > > > > I tried reverting fd7d55172d1e2e501e6da0a5c1de25f06612dc2e which is the > > > first bad commit > > > according to my bisect, but there were conflicts, and I'm not a kernel > > > developer, I couldn't resolve > > > them. > > > > The bisect shows bad and good alongside each other and so isn't > > categorically pointing at that change, which should be benign with > > your command lines given cgroup profiling isn't used. > > I was under the impression this is how bisect works, good and bad > revisions progress as the binary search tries to pick commits in the > middle and sometimes it is before, and sometime it is after the commit in > question. Given the above I don't understand what you mean by 'so isn't > categorically pointing at that change'. Could you please explain? Is > there a way for me to do a more precise bisect? Are there any related > DEBUG configs to enable extra asserts?
Sorry for misreading. I'm not aware of any asserts. Thanks, Ian > > There were > > related fixes, such as 7fa343b7fdc4f351de4e3f28d5c285937dd1f42f and > > 90c91dfb86d0ff545bd329d3ddd72c147e2ae198 but neither mention panics. > > The crash is happening in the PMU code rather than the generic perf > > handling, so maybe that is at issue? > > Thanks for clarifying some details! > > Tibor > > > > > Sorry to not have a fix for your problem, > > Ian > > > > > Bisect log: > > > > > > git bisect start > > > # bad: [70c707aae3e586f6ebc677ad733a9c3cea32f7ac] Linux 5.5.1 > > > git bisect bad 70c707aae3e586f6ebc677ad733a9c3cea32f7ac > > > # good: [aad39e30fb9e6e7212318a1dad898f36f1075648] Linux 5.2.9 > > > git bisect good aad39e30fb9e6e7212318a1dad898f36f1075648 > > > # good: [0ecfebd2b52404ae0c54a878c872bb93363ada36] Linux 5.2 > > > git bisect good 0ecfebd2b52404ae0c54a878c872bb93363ada36 > > > # bad: [3c30819dc68ab9498216421b6846123900c0a6d3] Merge > > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf > > > git bisect bad 3c30819dc68ab9498216421b6846123900c0a6d3 > > > # bad: [d9b9c893048e9d308a833619f0866f1f52778cf5] Merge tag > > > 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client > > > git bisect bad d9b9c893048e9d308a833619f0866f1f52778cf5 > > > # bad: [db0457338ece7482378d88e50ad298191c3e6947] Merge branch > > > 'for-linus' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching > > > git bisect bad db0457338ece7482378d88e50ad298191c3e6947 > > > # bad: [608745f12462e2d8d94d5cc5dc94bf0960a881e3] Merge branch > > > 'perf-core-for-linus' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip > > > git bisect bad 608745f12462e2d8d94d5cc5dc94bf0960a881e3 > > > # good: [947fbd4ca9fb38f320b076e68cfccab977c5ea01] Merge tag > > > 'please-pull-for_5.3' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras > > > git bisect good 947fbd4ca9fb38f320b076e68cfccab977c5ea01 > > > # good: [0dcb4efb1095d0a1f5f681c2b94e98b009cc5d77] ALSA: firewire-lib: > > > code refactoring for local variables > > > git bisect good 0dcb4efb1095d0a1f5f681c2b94e98b009cc5d77 > > > # good: [cf2d213e49fdf47e4c10dc629a3659e0026a54b8] Merge tag > > > 'pm-5.3-rc1' of > > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm > > > git bisect good cf2d213e49fdf47e4c10dc629a3659e0026a54b8 > > > # bad: [fe96245c7f38c4ea92c1c599b43f176e27d9921e] perf hists: Add > > > block_info in hist_entry > > > git bisect bad fe96245c7f38c4ea92c1c599b43f176e27d9921e > > > # good: [87407fa58b6645cecd24102f58476b8dd7ce778d] perf config: Update > > > default value for llvm.clang-bpf-cmd-template > > > git bisect good 87407fa58b6645cecd24102f58476b8dd7ce778d > > > # skip: [78d6ccce03e86de34c7000bcada493ed0679e350] perf build: Handle > > > slang being in /usr/include and in /usr/include/slang/ > > > git bisect skip 78d6ccce03e86de34c7000bcada493ed0679e350 > > > # good: [b9271f0c65e12efd9b6319f57a378aac1be98b22] Merge tag 'v5.2-rc6' > > > into perf/core, to refresh branch > > > git bisect good b9271f0c65e12efd9b6319f57a378aac1be98b22 > > > # good: [203dffacf592317e54480704f569a09f8b7ca380] Documentation/ABI: > > > Document umwait control sysfs interfaces > > > git bisect good 203dffacf592317e54480704f569a09f8b7ca380 > > > # bad: [cf8b6970f4fc31898f3d9e25159aa57e235ca4d1] perf symbols: We need > > > util.h in symbol-elf.c for zfree() > > > git bisect bad cf8b6970f4fc31898f3d9e25159aa57e235ca4d1 > > > # bad: [9f94c7f947e919c343b30f080285af53d0fa9902] perf tools: Increase > > > MAX_NR_CPUS and MAX_CACHES > > > git bisect bad 9f94c7f947e919c343b30f080285af53d0fa9902 > > > # good: [122f1c51b11a9e572263c4965d772381fcef06c5] perf/x86/rapl: Get > > > MSR values from new probe framework > > > git bisect good 122f1c51b11a9e572263c4965d772381fcef06c5 > > > # bad: [2d7102a0453769fd37e9f4ce68652e104fbf5c84] perf tools: Fix > > > cache.h include directive > > > git bisect bad 2d7102a0453769fd37e9f4ce68652e104fbf5c84 > > > # good: [637d97b53cdded02da55d0a25cda6fd6af3bd042] perf/x86/rapl: Get > > > quirk state from new probe framework > > > git bisect good 637d97b53cdded02da55d0a25cda6fd6af3bd042 > > > # bad: [fd7d55172d1e2e501e6da0a5c1de25f06612dc2e] perf/cgroups: Don't > > > rotate events for cgroups unnecessarily > > > git bisect bad fd7d55172d1e2e501e6da0a5c1de25f06612dc2e > > > # first bad commit: [fd7d55172d1e2e501e6da0a5c1de25f06612dc2e] > > > perf/cgroups: Don't rotate events for cgroups unnecessarily > > > > > > Is there any more information that I can gather to make this issue > > > easier to fix? > > > > > > Thanks, > > > Tibor Billes > >