Bug#700333: Stack trace
I merged a slightly better fix, you all were on cc. It's going into 3.10 and it's tagged stable, so it will show up in stable kernels soon. Thanks for the fix! But where did you post it - on LKML? (I didn't see it because I'm not subscribed to LKML?) -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
When you do a suspend/resume cycle. OK, yes, I've found it there. The bug says The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. so I'm guessing with Thomas' patch it suspends fine now? Yeah, now I'm using a patched kernel and it's OK. So, does it mean the problem is fixed by this patch or it's just confirmed and should be fixed by another one? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Sun, Apr 28, 2013 at 05:26:07PM +0400, vita...@yourcmc.ru wrote: When you do a suspend/resume cycle. OK, yes, I've found it there. The bug says The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. so I'm guessing with Thomas' patch it suspends fine now? Yeah, now I'm using a patched kernel and it's OK. So, does it mean the problem is fixed by this patch or it's just confirmed and should be fixed by another one? Well, it makes sense to me, at least: we remove the handler on suspend so that the HPET interrupt doesn't fire. If, when the box comes up again, the pending interrupt is cleared, then all is fine - we can safely register the handler again and everyone goes about their merry way. But don't worry, if Thomas has an idea, it is almost guaranteed you'll get a proper fix soon. :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Sun, 28 Apr 2013, Borislav Petkov wrote: On Sun, Apr 28, 2013 at 05:26:07PM +0400, vita...@yourcmc.ru wrote: When you do a suspend/resume cycle. OK, yes, I've found it there. The bug says The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. so I'm guessing with Thomas' patch it suspends fine now? Yeah, now I'm using a patched kernel and it's OK. So, does it mean the problem is fixed by this patch or it's just confirmed and should be fixed by another one? Well, it makes sense to me, at least: we remove the handler on suspend so that the HPET interrupt doesn't fire. If, when the box comes up again, the pending interrupt is cleared, then all is fine - we can safely register the handler again and everyone goes about their merry way. But don't worry, if Thomas has an idea, it is almost guaranteed you'll get a proper fix soon. :-) I merged a slightly better fix, you all were on cc. It's going into 3.10 and it's tagged stable, so it will show up in stable kernels soon. Thanks, tglx -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
Looks like we can't do anything about that in the HPET code itself. Vitaliy, could you try that patch ? Thanks, I've tried it several days ago (and still using a patched kernel :)) - the box survives. But at which moment should I check for Spurious interrupt in dmesg? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Sat, Apr 27, 2013 at 07:08:42PM +0400, vita...@yourcmc.ru wrote: Looks like we can't do anything about that in the HPET code itself. Vitaliy, could you try that patch ? Thanks, I've tried it several days ago (and still using a patched kernel :)) - the box survives. But at which moment should I check for Spurious interrupt in dmesg? When you do a suspend/resume cycle. The bug says The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. so I'm guessing with Thomas' patch it suspends fine now? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Mon, 22 Apr 2013, Thomas Gleixner wrote: With the patch below, the box should survive and we should see a Spurious HPET timer interrupt on HPET timer... entry in dmesg. That's a first workaround to confirm my theory. I'll look into the HPET code how we can avoid that at all. Looks like we can't do anything about that in the HPET code itself. Vitaliy, could you try that patch ? Thanks, tglx -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Sun, 21 Apr 2013, Borislav Petkov wrote: + tglx. On Sun, Apr 21, 2013 at 01:38:33AM +0400, vita...@yourcmc.ru wrote: Stack trace picture is here: http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg Vitaliy reported that his system crashes when suspending to disk. This was a regression from 3.2 to 3.7, and remains in 3.8. Some details of this system are in the bug log at http://bugs.debian.org/700333. The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. The HPET interrupt handler was called immediately after it was registered for CPU 2 (?), before the corresponding clock_event_device was registered. Seems like an obvious race condition, but then shouldn't the HPET have been stopped while the CPU was previously offlined? And it's strange that this system apparently hits the race quite reliably. Anyone? So what happens is, that the HPET seems to have an interrupt pending and this gets immediately fired, when the handler is installed. The core code does not remove the hpet-event_handler, so it calls into the hrtimer_interrupt where it hits the BUG and dies. With the patch below, the box should survive and we should see a Spurious HPET timer interrupt on HPET timer... entry in dmesg. That's a first workaround to confirm my theory. I'll look into the HPET code how we can avoid that at all. Thanks, tglx diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index b1600a6..0f0ce6e 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -323,6 +323,7 @@ static void tick_shutdown(unsigned int *cpup) */ dev-mode = CLOCK_EVT_MODE_UNUSED; clockevents_exchange_device(dev, NULL); + dev-event_handler = NULL; td-evtdev = NULL; } raw_spin_unlock_irqrestore(tick_device_lock, flags); -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
Stack trace picture is here: http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg Vitaliy reported that his system crashes when suspending to disk. This was a regression from 3.2 to 3.7, and remains in 3.8. Some details of this system are in the bug log at http://bugs.debian.org/700333. The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. The HPET interrupt handler was called immediately after it was registered for CPU 2 (?), before the corresponding clock_event_device was registered. Seems like an obvious race condition, but then shouldn't the HPET have been stopped while the CPU was previously offlined? And it's strange that this system apparently hits the race quite reliably. Anyone? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
+ tglx. On Sun, Apr 21, 2013 at 01:38:33AM +0400, vita...@yourcmc.ru wrote: Stack trace picture is here: http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg Vitaliy reported that his system crashes when suspending to disk. This was a regression from 3.2 to 3.7, and remains in 3.8. Some details of this system are in the bug log at http://bugs.debian.org/700333. The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. The HPET interrupt handler was called immediately after it was registered for CPU 2 (?), before the corresponding clock_event_device was registered. Seems like an obvious race condition, but then shouldn't the HPET have been stopped while the CPU was previously offlined? And it's strange that this system apparently hits the race quite reliably. Anyone? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Wed, 2013-03-06 at 15:27 +0400, vita...@yourcmc.ru wrote: [...] Stack trace picture is here: http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg Vitaliy reported that his system crashes when suspending to disk. This was a regression from 3.2 to 3.7, and remains in 3.8. Some details of this system are in the bug log at http://bugs.debian.org/700333. The photo shows a BUG in hrtimer_interrupt() after making the hibernation image and while resuming the non-boot CPUs. The HPET interrupt handler was called immediately after it was registered for CPU 2 (?), before the corresponding clock_event_device was registered. Seems like an obvious race condition, but then shouldn't the HPET have been stopped while the CPU was previously offlined? And it's strange that this system apparently hits the race quite reliably. Ben. -- Ben Hutchings The obvious mathematical breakthrough [to break modern encryption] would be development of an easy way to factor large prime numbers. - Bill Gates signature.asc Description: This is a digitally signed message part
Bug#700333: Stack trace
Hi Ben! Did the stack help you to identify something? Enabling non-boot CPUs seems suspicious to me - does that mean instead of writing an image to disk and hibernating it's trying to resume? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
No, but I think this kernel parameter will help: pause_on_oops= Halt all CPUs after the first oops has been printed for the specified number of seconds. This is to be used if your oopses keep scrolling off the screen. (How have I not noticed this in all the years I've been crashing kernels?!) Thanks, it helped :) By the way, this crash happens with init=/bin/bash Stack trace picture is here: http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
Hello I've booted with no_console_suspend and got the stack trace, however it's from 3.8-aptosid kernel. The problem with 3.8 is the same as with 3.7. Can someone please help me - what does this stack mean? Kernel panic - not syncing: Fatal exception in interrupt [ cut here ] WARNING: at /tmp/buildd/linux-aptosid-3.8/debian/build/source_amd64_none/arch/kernel/smp.c:123 update_process_times+0x55/0x61() Hardware name: Studio XPS 1645 Modules linked in: dm_mirror dm_region_hash dm_log dm_mod ext4 crc16 jbd2 mbcache sd_mod crc_t10dif thermal ahci libahci libata scsi_mod fan Pid: 17, comm: kworker/1:0 Tainted: G D 3.8-1.slh.2-aptosid-amd64 #1 Call Trace: IRQ warn_slowpath_common+0x76/0x8a update_process_times+0x55/0x61 tick_periodic+0x60/0x6b tick_handle_periodic+0x18/0x52 smp_apic_timer_interrupt+0x6e/0x81 apic_timer_interrupt+0x6d/0x80 up+0xc/0x35 panic+0x18b/0x1c7 panic+0xfd/0x1c7 oops_end+0x9c/0xa9 do_invalid_op+0x87/0x91 hrtimer_interrupt+0x24/0x1a4 load_balance+0xc3/0x62a run_posix_cpu_timers+0x25/0x57a invalid_op+0x1e/0x30 request_threaded_irq+0x84/0xf5 hrtimer_get_next_event+0x92/0x92 hrtimer_interrupt+0x24/0x1a4 tick_notify+0x216/0x378 hpet_interrupt_handler+0x23/0x2b request_threaded_irq+0x84/0xf5 handle_irq_event_percpu+0x24/0x124 handle_irq_event+0x37/0x57 handle_edge_irq+0x98/0xbb handle_irq+0x15/0x1d do_IRQ+0x41/0x97 common_interrupt+0x6d/0x6d request_threaded_irq+0x84/0xf5 vsnprintf+0x187/0x439 vsnprintf+0x70/0x439 snprintf+0x39/0x3e register_handler_proc+0xd8/0x114 __setup_irq+0x334/0x3d4 hpet_set_periodic_freq+0x5f/0x5f request_threaded_irq+0xba/0xf5 hpet_work+0xe7/0x1a6 process_one_work+0x15d/0x252 worker_thread+0x117/0x1b2 rescuer_thread+0x187/0x187 kthread+0x81/0x89 __kthread_parkme+0x5b/0x5b ret_from_fork+0x7c/0xb0 __kthread_parkme+0x5b/0x5b ---[ end trace e6f760295bda327e ]--- -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Tue, 2013-03-05 at 14:19 +0400, vita...@yourcmc.ru wrote: Hello I've booted with no_console_suspend and got the stack trace, however it's from 3.8-aptosid kernel. The problem with 3.8 is the same as with 3.7. Can someone please help me - what does this stack mean? It means nothing very much. How about the stack trace *before* this line: Kernel panic - not syncing: Fatal exception in interrupt [...] -- Ben Hutchings Always try to do things in chronological order; it's less confusing that way. signature.asc Description: This is a digitally signed message part
Bug#700333: Stack trace
[Please reply-to-all.] On Wed, Mar 06, 2013 at 12:25:45AM +0400, vita...@yourcmc.ru wrote: Hi Ben! It means nothing very much. How about the stack trace *before* this line: The problem is that the maximum available VESA mode is 1400x1050 on my laptop and the stack is very long, and obviously I can't scroll it after a kernel panic :-) How can I get to previous lines of it? :-) There is netconsole: https://www.kernel.org/doc/Documentation/networking/netconsole.txt Although that might not work while suspending. Serial console would probably work if the computer has a serial port. If neither of those works then you might be able to use a video recording and freeze- frame. Ben. -- Ben Hutchings We get into the habit of living before acquiring the habit of thinking. - Albert Camus -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
It means nothing very much. How about the stack trace *before* this line: The problem is that the maximum available VESA mode is 1400x1050 on my laptop and the stack is very long, and obviously I can't scroll it after a kernel panic :-) How can I get to previous lines of it? :-) There is netconsole: https://www.kernel.org/doc/Documentation/networking/netconsole.txt Although that might not work while suspending. Serial console would probably work if the computer has a serial port. If neither of those works then you might be able to use a video recording and freeze- frame. Yeah, the netconsole doesn't work during suspend - I've just checked, the last line it prints is Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. However the 1st time I tried to use netconsole the suspend surprisingly worked with 3.8 :-) the second time it returned back. So it seems the bug also isn't 100% reproducible. The computer has no serial port. And the video is also not an option - I've tried to film it with 60fps ContourHD, it seems the stack trace is printed very fast. It would be good to have some delay after printing each line of stack trace in the kernel - is there such an option? -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#700333: Stack trace
On Wed, 2013-03-06 at 01:49 +0400, vita...@yourcmc.ru wrote: It means nothing very much. How about the stack trace *before* this line: The problem is that the maximum available VESA mode is 1400x1050 on my laptop and the stack is very long, and obviously I can't scroll it after a kernel panic :-) How can I get to previous lines of it? :-) There is netconsole: https://www.kernel.org/doc/Documentation/networking/netconsole.txt Although that might not work while suspending. Serial console would probably work if the computer has a serial port. If neither of those works then you might be able to use a video recording and freeze- frame. Yeah, the netconsole doesn't work during suspend - I've just checked, the last line it prints is Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. However the 1st time I tried to use netconsole the suspend surprisingly worked with 3.8 :-) the second time it returned back. So it seems the bug also isn't 100% reproducible. The computer has no serial port. And the video is also not an option - I've tried to film it with 60fps ContourHD, it seems the stack trace is printed very fast. It would be good to have some delay after printing each line of stack trace in the kernel - is there such an option? No, but I think this kernel parameter will help: pause_on_oops= Halt all CPUs after the first oops has been printed for the specified number of seconds. This is to be used if your oopses keep scrolling off the screen. (How have I not noticed this in all the years I've been crashing kernels?!) Ben. -- Ben Hutchings Always try to do things in chronological order; it's less confusing that way. signature.asc Description: This is a digitally signed message part