On 07/18/2012 11:32 AM, David Sommerseth wrote:
On 18/07/12 18:35, Orion Poplawski wrote:
On 07/17/2012 11:22 AM, Orion Poplawski wrote:
Our SL6.2 KVM and nfs/backup server has been crashing frequently recently
(starting around Fri 13th - yikes!) with Kernel panic - Out of memory
and no
killable processes. The server has 48GB ram, 2GB swap, only about 15GB
dedicated to VM guests. I've tried bumping up vm.min_free_kbytes to
262144 to
no avail. Nothing strange is getting written to the logs before the
crash.
Happening with both 2.6.32-220.23.1 and 2.6.32-279.1.1.
Anyone else seeing this? Any other ideas? I've set a serial console
log to
try to catch more information the next time it happens.
here we go, see below. This makes no sense to me.
[<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460
^^^^^^^^^^^^^^^
wtf!?! What kind of partition tables and file systems do you use? This
OOM kill seems to be caused by the amiga partition table code in the
kernel. It looks like it's some LVM command causing this to happen
somehow, though.
Well I bet it's just scanning all partition types and:
/boot/config-2.6.32-220.23.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y
/boot/config-2.6.32-279.1.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
Now this is concerning ... you're out of swap, unless that's disabled.
I'm curious about this too. I have 8GB swap and top showed some being used.
4224700 4224638 99% 1.00K 1056175 4 4224700K ext4_inode_cache
This smells a bit bad ... ext4_inode_cache is using a lot of memory ...
3257480 3257186 99% 0.19K 162874 20 651496K dentry
1324786 1250981 94% 0.06K 22454 59 89816K size-64
484128 484094 99% 0.02K 3362 144 13448K avtab_node
347088 342539 98% 0.03K 3099 112 12396K size-32
342580 324110 94% 0.55K 48940 7 195760K radix_tree_node
236059 235736 99% 0.06K 4001 59 16004K ksm_rmap_item
123980 123566 99% 0.19K 6199 20 24796K size-192
105630 47803 45% 0.12K 3521 30 14084K size-128
24300 24261 99% 0.14K 900 27 3600K sysfs_dir_cache
17402 15599 89% 0.05K 226 77 904K anon_vma_chain
16055 14874 92% 0.20K 845 19 3380K vm_area_struct
9844 8471 86% 0.04K 107 92 428K anon_vma
8952 8775 98% 0.58K 1492 6 5968K inode_cache
7518 5829 77% 0.62K 1253 6 5012K proc_inode_cache
6840 4692 68% 0.19K 342 20 1368K filp
5888 5532 93% 0.04K 64 92 256K dm_io
top - 10:10:02 up 22:34, 4 users, load average: 1.02, 1.15, 1.53
Tasks: 888 total, 1 running, 887 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.8%us, 1.2%sy, 0.0%ni, 97.9%id, 0.1%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 49421492k total, 43619512k used, 5801980k free, 4409144k buffers
Swap: 8388600k total, 16308k used, 8372292k free, 25837164k cached
Somehow, this doesn't reflect what the kernel complains about when the
OOM killer starts its mission.
That's bugging me too.
I see that you're using kernel-2.6.32-279.1.1.el6.x86_64 ... that
smells a bit like a SL 6.3 Beta ... is that right? As SL 6.2 is usually
around 2.6.32-220-something. I would probably recommend you to try a
6.2 kernel if you're running something much more bleeding edge.
I was running 220-23.1 and it was crashing so I tried the newer one to see if
that helped. I think I'll back off now.
And it somehow seems to be related to some file system issues ... at
least from what I can see. Could be a bugy kernel which leaks memory,
somewhere in either the parition table code or ext4 code paths.
One possibility perhaps. The machine comes up doing a md sync:
md1 : active raid10 sdh1[4] sdb2[0] sda2[1] sdd1[2] sde1[7] sdf1[6] sdc1[3]
sdg1[5]
3906203648 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
[=>...................] resync = 9.2% (362126208/3906203648)
finish=1334.6min speed=44255K/sec
I wonder if when that completes some kind of lvm device scan is triggered
which causes the problem. I'm not sure what fires off a lvm process in the
first place.
--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA, Boulder Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.nwra.com