Bug#700333: Stack trace

2013-04-30 Thread vitalif

I merged a slightly better fix, you all were on cc. It's going into
3.10 and it's tagged stable, so it will show up in stable kernels
soon.


Thanks for the fix!
But where did you post it - on LKML?
(I didn't see it because I'm not subscribed to LKML?)


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-28 Thread vitalif

When you do a suspend/resume cycle.


OK, yes, I've found it there.

The bug says The photo shows a BUG in hrtimer_interrupt() after 
making

the hibernation image and while resuming the non-boot CPUs. so I'm
guessing with Thomas' patch it suspends fine now?


Yeah, now I'm using a patched kernel and it's OK.

So, does it mean the problem is fixed by this patch or it's just 
confirmed and should be fixed by another one?



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-28 Thread Borislav Petkov
On Sun, Apr 28, 2013 at 05:26:07PM +0400, vita...@yourcmc.ru wrote:
 When you do a suspend/resume cycle.
 
 OK, yes, I've found it there.
 
 The bug says The photo shows a BUG in hrtimer_interrupt() after
 making
 the hibernation image and while resuming the non-boot CPUs. so I'm
 guessing with Thomas' patch it suspends fine now?
 
 Yeah, now I'm using a patched kernel and it's OK.
 
 So, does it mean the problem is fixed by this patch or it's just
 confirmed and should be fixed by another one?

Well, it makes sense to me, at least: we remove the handler on suspend
so that the HPET interrupt doesn't fire. If, when the box comes up
again, the pending interrupt is cleared, then all is fine - we can
safely register the handler again and everyone goes about their merry
way.

But don't worry, if Thomas has an idea, it is almost guaranteed you'll
get a proper fix soon. :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-28 Thread Thomas Gleixner
On Sun, 28 Apr 2013, Borislav Petkov wrote:

 On Sun, Apr 28, 2013 at 05:26:07PM +0400, vita...@yourcmc.ru wrote:
  When you do a suspend/resume cycle.
  
  OK, yes, I've found it there.
  
  The bug says The photo shows a BUG in hrtimer_interrupt() after
  making
  the hibernation image and while resuming the non-boot CPUs. so I'm
  guessing with Thomas' patch it suspends fine now?
  
  Yeah, now I'm using a patched kernel and it's OK.
  
  So, does it mean the problem is fixed by this patch or it's just
  confirmed and should be fixed by another one?
 
 Well, it makes sense to me, at least: we remove the handler on suspend
 so that the HPET interrupt doesn't fire. If, when the box comes up
 again, the pending interrupt is cleared, then all is fine - we can
 safely register the handler again and everyone goes about their merry
 way.
 
 But don't worry, if Thomas has an idea, it is almost guaranteed you'll
 get a proper fix soon. :-)

I merged a slightly better fix, you all were on cc. It's going into
3.10 and it's tagged stable, so it will show up in stable kernels
soon.

Thanks,

tglx


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-27 Thread vitalif

Looks like we can't do anything about that in the HPET code itself.

Vitaliy, could you try that patch ?


Thanks, I've tried it several days ago (and still using a patched 
kernel :)) - the box survives.

But at which moment should I check for Spurious interrupt in dmesg?


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-27 Thread Borislav Petkov
On Sat, Apr 27, 2013 at 07:08:42PM +0400, vita...@yourcmc.ru wrote:
 Looks like we can't do anything about that in the HPET code itself.
 
 Vitaliy, could you try that patch ?
 
 Thanks, I've tried it several days ago (and still using a patched
 kernel :)) - the box survives.
 But at which moment should I check for Spurious interrupt in dmesg?

When you do a suspend/resume cycle.

The bug says The photo shows a BUG in hrtimer_interrupt() after making
the hibernation image and while resuming the non-boot CPUs. so I'm
guessing with Thomas' patch it suspends fine now?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-25 Thread Thomas Gleixner
On Mon, 22 Apr 2013, Thomas Gleixner wrote:
 With the patch below, the box should survive and we should see a 
 
 Spurious HPET timer interrupt on HPET timer... entry in dmesg.
 
 That's a first workaround to confirm my theory. I'll look into the
 HPET code how we can avoid that at all.

Looks like we can't do anything about that in the HPET code itself.

Vitaliy, could you try that patch ?

Thanks,

tglx


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-22 Thread Thomas Gleixner
On Sun, 21 Apr 2013, Borislav Petkov wrote:

 + tglx.
 
 On Sun, Apr 21, 2013 at 01:38:33AM +0400, vita...@yourcmc.ru wrote:
  Stack trace picture is here:
  http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg
  
  Vitaliy reported that his system crashes when suspending to disk.
  This
  was a regression from 3.2 to 3.7, and remains in 3.8.  Some
  details of
  this system are in the bug log at http://bugs.debian.org/700333.
  
  The photo shows a BUG in hrtimer_interrupt() after making the
  hibernation image and while resuming the non-boot CPUs.  The HPET
  interrupt handler was called immediately after it was registered
  for CPU
  2 (?), before the corresponding clock_event_device was registered.
  
  Seems like an obvious race condition, but then shouldn't the HPET
  have
  been stopped while the CPU was previously offlined?  And it's strange
  that this system apparently hits the race quite reliably.
  
  Anyone?

So what happens is, that the HPET seems to have an interrupt pending
and this gets immediately fired, when the handler is installed. The
core code does not remove the hpet-event_handler, so it calls into
the hrtimer_interrupt where it hits the BUG and dies.

With the patch below, the box should survive and we should see a 

Spurious HPET timer interrupt on HPET timer... entry in dmesg.

That's a first workaround to confirm my theory. I'll look into the
HPET code how we can avoid that at all.

Thanks,

tglx

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index b1600a6..0f0ce6e 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -323,6 +323,7 @@ static void tick_shutdown(unsigned int *cpup)
 */
dev-mode = CLOCK_EVT_MODE_UNUSED;
clockevents_exchange_device(dev, NULL);
+   dev-event_handler = NULL;
td-evtdev = NULL;
}
raw_spin_unlock_irqrestore(tick_device_lock, flags);


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-20 Thread vitalif

Stack trace picture is here:
http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg


Vitaliy reported that his system crashes when suspending to disk.  
This
was a regression from 3.2 to 3.7, and remains in 3.8.  Some details 
of

this system are in the bug log at http://bugs.debian.org/700333.

The photo shows a BUG in hrtimer_interrupt() after making the
hibernation image and while resuming the non-boot CPUs.  The HPET
interrupt handler was called immediately after it was registered for 
CPU

2 (?), before the corresponding clock_event_device was registered.

Seems like an obvious race condition, but then shouldn't the HPET 
have

been stopped while the CPU was previously offlined?  And it's strange
that this system apparently hits the race quite reliably.


Anyone?


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-04-20 Thread Borislav Petkov
+ tglx.

On Sun, Apr 21, 2013 at 01:38:33AM +0400, vita...@yourcmc.ru wrote:
 Stack trace picture is here:
 http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg
 
 Vitaliy reported that his system crashes when suspending to disk.
 This
 was a regression from 3.2 to 3.7, and remains in 3.8.  Some
 details of
 this system are in the bug log at http://bugs.debian.org/700333.
 
 The photo shows a BUG in hrtimer_interrupt() after making the
 hibernation image and while resuming the non-boot CPUs.  The HPET
 interrupt handler was called immediately after it was registered
 for CPU
 2 (?), before the corresponding clock_event_device was registered.
 
 Seems like an obvious race condition, but then shouldn't the HPET
 have
 been stopped while the CPU was previously offlined?  And it's strange
 that this system apparently hits the race quite reliably.
 
 Anyone?

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-03-10 Thread Ben Hutchings
On Wed, 2013-03-06 at 15:27 +0400, vita...@yourcmc.ru wrote:
[...]
 Stack trace picture is here: 
 http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg

Vitaliy reported that his system crashes when suspending to disk.  This
was a regression from 3.2 to 3.7, and remains in 3.8.  Some details of
this system are in the bug log at http://bugs.debian.org/700333.

The photo shows a BUG in hrtimer_interrupt() after making the
hibernation image and while resuming the non-boot CPUs.  The HPET
interrupt handler was called immediately after it was registered for CPU
2 (?), before the corresponding clock_event_device was registered.

Seems like an obvious race condition, but then shouldn't the HPET have
been stopped while the CPU was previously offlined?  And it's strange
that this system apparently hits the race quite reliably.

Ben.

-- 
Ben Hutchings
The obvious mathematical breakthrough [to break modern encryption] would be
development of an easy way to factor large prime numbers. - Bill Gates


signature.asc
Description: This is a digitally signed message part


Bug#700333: Stack trace

2013-03-07 Thread vitalif

Hi Ben!

Did the stack help you to identify something?

Enabling non-boot CPUs seems suspicious to me - does that mean 
instead of writing an image to disk and hibernating it's trying to 
resume?



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-03-06 Thread vitalif

No, but I think this kernel parameter will help:

pause_on_oops=
Halt all CPUs after the first oops has been printed for
the specified number of seconds.  This is to be used if
your oopses keep scrolling off the screen.

(How have I not noticed this in all the years I've been crashing
kernels?!)


Thanks, it helped :)
By the way, this crash happens with init=/bin/bash

Stack trace picture is here: 
http://vmx.yourcmc.ru/var/pics/IMG_20130306_141045.jpg



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-03-05 Thread vitalif

Hello

I've booted with no_console_suspend and got the stack trace, however 
it's from 3.8-aptosid kernel. The problem with 3.8 is the same as with 
3.7.


Can someone please help me - what does this stack mean?

Kernel panic - not syncing: Fatal exception in interrupt
[ cut here ]
WARNING: at 
/tmp/buildd/linux-aptosid-3.8/debian/build/source_amd64_none/arch/kernel/smp.c:123 
update_process_times+0x55/0x61()

Hardware name: Studio XPS 1645
Modules linked in: dm_mirror dm_region_hash dm_log dm_mod ext4 crc16 
jbd2 mbcache sd_mod crc_t10dif thermal ahci libahci libata scsi_mod fan

Pid: 17, comm: kworker/1:0 Tainted: G D 3.8-1.slh.2-aptosid-amd64 #1
Call Trace:
IRQ warn_slowpath_common+0x76/0x8a
update_process_times+0x55/0x61
tick_periodic+0x60/0x6b
tick_handle_periodic+0x18/0x52
smp_apic_timer_interrupt+0x6e/0x81
apic_timer_interrupt+0x6d/0x80
up+0xc/0x35
panic+0x18b/0x1c7
panic+0xfd/0x1c7
oops_end+0x9c/0xa9
do_invalid_op+0x87/0x91
hrtimer_interrupt+0x24/0x1a4
load_balance+0xc3/0x62a
run_posix_cpu_timers+0x25/0x57a
invalid_op+0x1e/0x30
request_threaded_irq+0x84/0xf5
hrtimer_get_next_event+0x92/0x92
hrtimer_interrupt+0x24/0x1a4
tick_notify+0x216/0x378
hpet_interrupt_handler+0x23/0x2b
request_threaded_irq+0x84/0xf5
handle_irq_event_percpu+0x24/0x124
handle_irq_event+0x37/0x57
handle_edge_irq+0x98/0xbb
handle_irq+0x15/0x1d
do_IRQ+0x41/0x97
common_interrupt+0x6d/0x6d
request_threaded_irq+0x84/0xf5
vsnprintf+0x187/0x439
vsnprintf+0x70/0x439
snprintf+0x39/0x3e
register_handler_proc+0xd8/0x114
__setup_irq+0x334/0x3d4
hpet_set_periodic_freq+0x5f/0x5f
request_threaded_irq+0xba/0xf5
hpet_work+0xe7/0x1a6
process_one_work+0x15d/0x252
worker_thread+0x117/0x1b2
rescuer_thread+0x187/0x187
kthread+0x81/0x89
__kthread_parkme+0x5b/0x5b
ret_from_fork+0x7c/0xb0
__kthread_parkme+0x5b/0x5b
---[ end trace e6f760295bda327e ]---


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-03-05 Thread Ben Hutchings
On Tue, 2013-03-05 at 14:19 +0400, vita...@yourcmc.ru wrote:
 Hello
 
 I've booted with no_console_suspend and got the stack trace, however 
 it's from 3.8-aptosid kernel. The problem with 3.8 is the same as with 
 3.7.
 
 Can someone please help me - what does this stack mean?

It means nothing very much.  How about the stack trace *before* this
line:

 Kernel panic - not syncing: Fatal exception in interrupt
[...]

-- 
Ben Hutchings
Always try to do things in chronological order;
it's less confusing that way.


signature.asc
Description: This is a digitally signed message part


Bug#700333: Stack trace

2013-03-05 Thread Ben Hutchings
[Please reply-to-all.]

On Wed, Mar 06, 2013 at 12:25:45AM +0400, vita...@yourcmc.ru wrote:
 Hi Ben!
 
 It means nothing very much.  How about the stack trace *before* this
 line:
 
 The problem is that the maximum available VESA mode is 1400x1050 on
 my laptop and the stack is very long, and obviously I can't scroll
 it after a kernel panic :-)
 How can I get to previous lines of it? :-)

There is netconsole:
https://www.kernel.org/doc/Documentation/networking/netconsole.txt

Although that might not work while suspending.  Serial console would
probably work if the computer has a serial port.  If neither of those
works then you might be able to use a video recording and freeze-
frame.

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-03-05 Thread vitalif
It means nothing very much.  How about the stack trace *before* 
this

line:

The problem is that the maximum available VESA mode is 1400x1050 on
my laptop and the stack is very long, and obviously I can't scroll
it after a kernel panic :-)
How can I get to previous lines of it? :-)


There is netconsole:
https://www.kernel.org/doc/Documentation/networking/netconsole.txt

Although that might not work while suspending.  Serial console would
probably work if the computer has a serial port.  If neither of those
works then you might be able to use a video recording and freeze-
frame.


Yeah, the netconsole doesn't work during suspend - I've just checked, 
the last line it prints is Freezing remaining freezable tasks ... 
(elapsed 0.01 seconds) done.


However the 1st time I tried to use netconsole the suspend surprisingly 
worked with 3.8 :-) the second time it returned back. So it seems the 
bug also isn't 100% reproducible.


The computer has no serial port.

And the video is also not an option - I've tried to film it with 60fps 
ContourHD, it seems the stack trace is printed very fast.


It would be good to have some delay after printing each line of stack 
trace in the kernel - is there such an option?



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#700333: Stack trace

2013-03-05 Thread Ben Hutchings
On Wed, 2013-03-06 at 01:49 +0400, vita...@yourcmc.ru wrote:
  It means nothing very much.  How about the stack trace *before* 
  this
  line:
 
  The problem is that the maximum available VESA mode is 1400x1050 on
  my laptop and the stack is very long, and obviously I can't scroll
  it after a kernel panic :-)
  How can I get to previous lines of it? :-)
 
  There is netconsole:
  https://www.kernel.org/doc/Documentation/networking/netconsole.txt
 
  Although that might not work while suspending.  Serial console would
  probably work if the computer has a serial port.  If neither of those
  works then you might be able to use a video recording and freeze-
  frame.
 
 Yeah, the netconsole doesn't work during suspend - I've just checked, 
 the last line it prints is Freezing remaining freezable tasks ... 
 (elapsed 0.01 seconds) done.
 
 However the 1st time I tried to use netconsole the suspend surprisingly 
 worked with 3.8 :-) the second time it returned back. So it seems the 
 bug also isn't 100% reproducible.
 
 The computer has no serial port.
 
 And the video is also not an option - I've tried to film it with 60fps 
 ContourHD, it seems the stack trace is printed very fast.
 
 It would be good to have some delay after printing each line of stack 
 trace in the kernel - is there such an option?

No, but I think this kernel parameter will help:

pause_on_oops=
Halt all CPUs after the first oops has been printed for
the specified number of seconds.  This is to be used if
your oopses keep scrolling off the screen.

(How have I not noticed this in all the years I've been crashing
kernels?!)

Ben.

-- 
Ben Hutchings
Always try to do things in chronological order;
it's less confusing that way.


signature.asc
Description: This is a digitally signed message part