Bug#799853: linux-image-2.6.32-5-xen-amd64: Xen kernel BUG: unable to handle kernel paging request

2015-10-04 Thread Ian Campbell
On Wed, 2015-09-23 at 11:51 +0200, Zdeněk Bělehrádek wrote:
> Package: linux-2.6
> Version: 2.6.32-48squeeze10

I'm afraid that with Squeeze being old-old-stable at this point your
best bet is going to be to upgrade at least the kernel if not the whole
distro to something > Squeeze.

I'd suggest starting with 2.6.32-48squeeze14 from the lts effort, if
that doesn't help (which I think is most likely going to be the case)
then 3.2.68-1+deb7u3~bpo60+1 from the o-o-bpo might be a good bet.
That's assuming you cannot upgrade the entire system to Wheezy or even
Jessie, which would be best of course.

Ian.


> Severity: important
> 
> 
> We have several virtual servers running under Xen, and two of them
> crash every few hours to days. Crash times are quite random, we have
> seen two crashes just about 2 minutes apart, and also few days went
> without crashing.
> 
> The crashing servers are used as mailservers, and run several
> instances of Exim, each listenning on different loopback and public
> IP address. Our customer uses it to send bulk e-mails, so there are
> long intervals of inactivity. We do have more of these serevrs, only
> two of them are crashing.
> 
> I checked the core dump with the crash utility, and it always hits
> kernel BUG: unable to handle kernel paging request, always in the
> same function and with the same backtrace. The crash is always
> triggered by collectd process. We tried to update kernel to latest
> version, and it had no effect. 
> 
> The hypervisor is xen-hypervisor-4.4-amd64 from Debian Jessie, the
> Dom0 is also Jessie. There is enough RAM in physical HW to support
> all the guests and some more.
> 
> I censored hostnames and IP addresses to protect the innocent.
> 
> -- Dmesg from crashed guest:
> 
> [12694.749508] BUG: unable to handle kernel paging request at
> 880002c49500
> [12694.750086] IP: [] inet_diag_dump+0x39f/0x78f
> [inet_diag]
> [12694.750690] PGD 1002067 PUD 1006067 PMD 3a9f067 PTE 0
> [12694.751300] Oops:  [#1] SMP 
> [12694.751932] last sysfs file: /sys/devices/vbd
> -2049/block/xvda1/uevent
> [12694.752007] CPU 0 
> [12694.752007] Modules linked in: tcp_diag inet_diag loop snd_pcm
> snd_timer snd soundcore snd_page_alloc evdev pcspkr joydev ext3 jbd
> mbcache dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq
> async_xor xor async_memcpy async_tx raid1 raid0 multipath linear
> md_mod xen_blkfront xen_netfront
> [12694.752007] Pid: 892, comm: collectd Not tainted 2.6.32-5-xen
> -amd64 #1 
> [12694.752007] RIP: e030:[]  []
> inet_diag_dump+0x39f/0x78f [inet_diag]
> [12694.752007] RSP: e02b:8800fcc31a88  EFLAGS: 00010246
> [12694.752007] RAX: 880002c49500 RBX: 8800fc558c70 RCX:
> 
> [12694.752007] RDX: 8800fef1cdc0 RSI: 8800fc558c60 RDI:
> 8800fd848148
> [12694.752007] RBP: 880002c3ef00 R08: 8800fc558000 R09:
> 
> [12694.752007] R10: 7f08762daeb0 R11: 8127be52 R12:
> 8800fc558c60
> [12694.752007] R13: 8800fdd8ea20 R14: 8800fd848000 R15:
> 8800fc558c60
> [12694.752007] FS:  7f08762db700() GS:880003add000()
> knlGS:
> [12694.752007] CS:  e033 DS:  ES:  CR0: 8005003b
> [12694.752007] CR2: 880002c49500 CR3: fd023000 CR4:
> 2660
> [12694.752007] DR0:  DR1:  DR2:
> 
> [12694.752007] DR3:  DR6: 0ff0 DR7:
> 0400
> [12694.752007] Process collectd (pid: 892, threadinfo
> 8800fcc3, task 8800fc902350)
> [12694.752007] Stack:
> [12694.752007]  81255a22 8800fc558000 0004
> 880002b14f00
> [12694.752007] <0> 001c 816e1f80 001c00d0
> 
> [12694.752007] <0> 880002a5a810 816e1d80 00d0
> 0074
> [12694.752007] Call Trace:
> [12694.752007]  [] ? sock_rmalloc+0x29/0x86
> [12694.752007]  [] ? netlink_dump+0x54/0x16c
> [12694.752007]  [] ? netlink_recvmsg+0x1a6/0x2c0
> [12694.752007]  [] ?
> hrtimer_try_to_cancel+0x3a/0x43
> [12694.752007]  [] ? sock_recvmsg+0xa6/0xbe
> [12694.752007]  [] ?
> autoremove_wake_function+0x0/0x2e
> [12694.752007]  [] ?
> autoremove_wake_function+0x0/0x2e
> [12694.752007]  [] ? verify_iovec+0x52/0xa2
> [12694.752007]  [] ? verify_iovec+0x52/0xa2
> [12694.752007]  [] ? sys_recvmsg+0x1b7/0x2cc
> [12694.752007]  [] ? sk_prot_alloc+0x79/0x12f
> [12694.752007]  [] ? sock_attach_fd+0x91/0xbf
> [12694.752007]  [] ? fd_install+0x2e/0x5a
> [12694.752007]  [] ? sock_map_fd+0x57/0x64
> [12694.752007]  [] ? system_call_fastpath+0x16/0x1b
> [12694.752007] Code: 34 4c 89 f7 c7 43 3c 00 00 00 00 e8 88 b8 0d e1
> c7 43 44 00 00 00 00 89 43 40 41 80 7c 24 10 0a 75 35 0f b7 45 38 48
> 8d 44 05 00 <48> 8b 10 49 89 54 24 18 48 8b 40 08 49 89 44 24 20 0f
> b7 45 38 
> [12694.752007] RIP  [] inet_diag_dump+0x39f/0x78f
> [inet_diag]
> [12694.752007]  RSP 
> [12694.752007] CR2: 880002c49500
> 

Bug#799853: linux-image-2.6.32-5-xen-amd64: Xen kernel BUG: unable to handle kernel paging request

2015-09-23 Thread Zdeněk Bělehrádek
Package: linux-2.6
Version: 2.6.32-48squeeze10
Severity: important


We have several virtual servers running under Xen, and two of them crash every 
few hours to days. Crash times are quite random, we have seen two crashes just 
about 2 minutes apart, and also few days went without crashing.

The crashing servers are used as mailservers, and run several instances of 
Exim, each listenning on different loopback and public IP address. Our customer 
uses it to send bulk e-mails, so there are long intervals of inactivity. We do 
have more of these serevrs, only two of them are crashing.

I checked the core dump with the crash utility, and it always hits kernel BUG: 
unable to handle kernel paging request, always in the same function and with 
the same backtrace. The crash is always triggered by collectd process. We tried 
to update kernel to latest version, and it had no effect. 

The hypervisor is xen-hypervisor-4.4-amd64 from Debian Jessie, the Dom0 is also 
Jessie. There is enough RAM in physical HW to support all the guests and some 
more.

I censored hostnames and IP addresses to protect the innocent.

-- Dmesg from crashed guest:

[12694.749508] BUG: unable to handle kernel paging request at 880002c49500
[12694.750086] IP: [] inet_diag_dump+0x39f/0x78f [inet_diag]
[12694.750690] PGD 1002067 PUD 1006067 PMD 3a9f067 PTE 0
[12694.751300] Oops:  [#1] SMP 
[12694.751932] last sysfs file: /sys/devices/vbd-2049/block/xvda1/uevent
[12694.752007] CPU 0 
[12694.752007] Modules linked in: tcp_diag inet_diag loop snd_pcm snd_timer snd 
soundcore snd_page_alloc evdev pcspkr joydev ext3 jbd mbcache dm_mod raid10 
raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx 
raid1 raid0 multipath linear md_mod xen_blkfront xen_netfront
[12694.752007] Pid: 892, comm: collectd Not tainted 2.6.32-5-xen-amd64 #1 
[12694.752007] RIP: e030:[]  [] 
inet_diag_dump+0x39f/0x78f [inet_diag]
[12694.752007] RSP: e02b:8800fcc31a88  EFLAGS: 00010246
[12694.752007] RAX: 880002c49500 RBX: 8800fc558c70 RCX: 
[12694.752007] RDX: 8800fef1cdc0 RSI: 8800fc558c60 RDI: 8800fd848148
[12694.752007] RBP: 880002c3ef00 R08: 8800fc558000 R09: 
[12694.752007] R10: 7f08762daeb0 R11: 8127be52 R12: 8800fc558c60
[12694.752007] R13: 8800fdd8ea20 R14: 8800fd848000 R15: 8800fc558c60
[12694.752007] FS:  7f08762db700() GS:880003add000() 
knlGS:
[12694.752007] CS:  e033 DS:  ES:  CR0: 8005003b
[12694.752007] CR2: 880002c49500 CR3: fd023000 CR4: 2660
[12694.752007] DR0:  DR1:  DR2: 
[12694.752007] DR3:  DR6: 0ff0 DR7: 0400
[12694.752007] Process collectd (pid: 892, threadinfo 8800fcc3, task 
8800fc902350)
[12694.752007] Stack:
[12694.752007]  81255a22 8800fc558000 0004 
880002b14f00
[12694.752007] <0> 001c 816e1f80 001c00d0 

[12694.752007] <0> 880002a5a810 816e1d80 00d0 
0074
[12694.752007] Call Trace:
[12694.752007]  [] ? sock_rmalloc+0x29/0x86
[12694.752007]  [] ? netlink_dump+0x54/0x16c
[12694.752007]  [] ? netlink_recvmsg+0x1a6/0x2c0
[12694.752007]  [] ? hrtimer_try_to_cancel+0x3a/0x43
[12694.752007]  [] ? sock_recvmsg+0xa6/0xbe
[12694.752007]  [] ? autoremove_wake_function+0x0/0x2e
[12694.752007]  [] ? autoremove_wake_function+0x0/0x2e
[12694.752007]  [] ? verify_iovec+0x52/0xa2
[12694.752007]  [] ? verify_iovec+0x52/0xa2
[12694.752007]  [] ? sys_recvmsg+0x1b7/0x2cc
[12694.752007]  [] ? sk_prot_alloc+0x79/0x12f
[12694.752007]  [] ? sock_attach_fd+0x91/0xbf
[12694.752007]  [] ? fd_install+0x2e/0x5a
[12694.752007]  [] ? sock_map_fd+0x57/0x64
[12694.752007]  [] ? system_call_fastpath+0x16/0x1b
[12694.752007] Code: 34 4c 89 f7 c7 43 3c 00 00 00 00 e8 88 b8 0d e1 c7 43 44 
00 00 00 00 89 43 40 41 80 7c 24 10 0a 75 35 0f b7 45 38 48 8d 44 05 00 <48> 8b 
10 49 89 54 24 18 48 8b 40 08 49 89 44 24 20 0f b7 45 38 
[12694.752007] RIP  [] inet_diag_dump+0x39f/0x78f [inet_diag]
[12694.752007]  RSP 
[12694.752007] CR2: 880002c49500
[12694.752007] ---[ end trace 63491d1bbc9c1a62 ]---
[12694.752007] Kernel panic - not syncing: Fatal exception in interrupt
[12694.752007] Pid: 892, comm: collectd Tainted: G  D2.6.32-5-xen-amd64 
#1
[12694.752007] Call Trace:
[12694.752007]  [] ? panic+0x86/0x143
[12694.752007]  [] ? _spin_lock_irqsave+0x15/0x34
[12694.752007]  [] ? bit_cursor+0x0/0x480
[12694.752007]  [] ? up+0xe/0x37
[12694.752007]  [] ? _spin_unlock_irqrestore+0xd/0xe
[12694.752007]  [] ? release_console_sem+0x17e/0x1af
[12694.752007]  [] ? oops_end+0xa7/0xb4
[12694.752007]  [] ? no_context+0x1e9/0x1f8
[12694.752007]  [] ? __nlmsg_put+0x35/0x70 [inet_diag]
[12694.752007]  [] ? __bad_area_nosemaphore+0x1a6/0x1ca
[12694.752007]  [] ? inet_csk_diag_fill+0x30d/0x388