On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote: > Hi, > > I see the following traceback (or similar tracebacks) once in a while > during my boot tests. In this specific case it is with mainline > (v5.3-rc1-195-g3ea54d9b0d65), but I have seen it with other branches > as well. This isn't a new problem; I have seen it for quite some time. > There is no specific action required to make it appear; just running > reboot loops is sufficient. The problem doesn't happen a lot; > non-scientifically I would say I see it maybe once every few hundred > boots. > > No specific action requested or asked for; this is just informational. > > A complete log is at: > https://kerneltests.org/builders/qemu-x86-master/builds/1285/steps/qemubuildcommand/logs/stdio > > Guenter > > --- > [ 61.248329] sd 0:0:0:0: [sda] Synchronizing SCSI cache > [ 61.268277] e1000e: EEE TX LPI TIMER: 00000000 > [ 61.311435] reboot: Restarting system > [ 61.312321] reboot: machine restart > [ 61.342193] ------------[ cut here ]------------ > [ 61.342660] sched: Unexpected reschedule of offline CPU#2! > ILLOPC: ce241f83: 0f 0b > [ 61.344323] WARNING: CPU: 1 PID: 15 at arch/x86/kernel/smp.c:126 > native_smp_send_reschedule+0x33/0x40 > [ 61.344836] Modules linked in: > [ 61.345694] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 5.3.0-rc1+ #1 > [ 61.345998] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > [ 61.346569] EIP: native_smp_send_reschedule+0x33/0x40 > [ 61.347099] Code: cf 73 1c 8b 15 60 54 2b cf 8b 4a 18 ba fd 00 00 00 e8 05 > 65 c7 00 c9 c3 8d b4 26 00 00 00 00 50 68 04 ca 1a cf e8 fe e3 01 00 <0f> 0b > 58 5a c9 c3 8d b4 26 00 00 00 00 55 89 e5 56 53 83 ec 0c 65 > [ 61.347726] EAX: 0000002e EBX: 00000002 ECX: 00000000 EDX: cdd64140 > [ 61.347977] ESI: 00000002 EDI: 00000000 EBP: cdd73c88 ESP: cdd73c80 > [ 61.348234] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00000096 > [ 61.348514] CR0: 80050033 CR2: b7ee7048 CR3: 0c28f000 CR4: 000006d0 > [ 61.348866] Call Trace: > [ 61.349392] kick_ilb+0x90/0xa0 > [ 61.349629] trigger_load_balance+0xf0/0x5c0 > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0 > [ 61.350057] scheduler_tick+0xa7/0xd0
kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu(). idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the cpu_is_offline() clause of the idle loop. However, when offline, cpu_active() should also be false, and this function should no-op. Then we have nohz_balance_exit_idle() from sched_cpu_dying(), which should explicitly clear the CPU from the mask when going offline. So I'm not immediately seeing how we can select an offline CPU to kick.

