On 07/18/2012 11:32 AM, David Sommerseth wrote:
On 18/07/12 18:35, Orion Poplawski wrote:
On 07/17/2012 11:22 AM, Orion Poplawski wrote:
Our SL6.2 KVM and nfs/backup server has been crashing frequently recently
(starting around Fri 13th - yikes!) with Kernel panic - Out of memory
and no
killable processes.  The server has 48GB ram, 2GB swap, only about 15GB
dedicated to VM guests.  I've tried bumping up vm.min_free_kbytes to
262144 to
no avail.  Nothing strange is getting written to the logs before the
crash.

Happening with both 2.6.32-220.23.1 and 2.6.32-279.1.1.

Anyone else seeing this?  Any other ideas?  I've set a serial console
log to
try to catch more information the next time it happens.


here we go, see below.  This makes no sense to me.

  [<ffffffff811edc5d>] ? amiga_partition+0x6d/0x460
                           ^^^^^^^^^^^^^^^
wtf!?!  What kind of partition tables and file systems do you use?  This
OOM kill seems to be caused by the amiga partition table code in the
kernel.  It looks like it's some LVM command causing this to happen
somehow, though.

Well I bet it's just scanning all partition types and:

/boot/config-2.6.32-220.23.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y
/boot/config-2.6.32-279.1.1.el6.x86_64:CONFIG_AMIGA_PARTITION=y


0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB

Now this is concerning ... you're out of swap, unless that's disabled.

I'm curious about this too.  I have 8GB swap and top showed some being used.

4224700 4224638  99%    1.00K 1056175        4   4224700K ext4_inode_cache

This smells a bit bad ... ext4_inode_cache is using a lot of memory ...


3257480 3257186  99%    0.19K 162874       20    651496K dentry
1324786 1250981  94%    0.06K  22454       59     89816K size-64
484128 484094  99%    0.02K   3362      144     13448K avtab_node
347088 342539  98%    0.03K   3099      112     12396K size-32
342580 324110  94%    0.55K  48940        7    195760K radix_tree_node
236059 235736  99%    0.06K   4001       59     16004K ksm_rmap_item
123980 123566  99%    0.19K   6199       20     24796K size-192
105630  47803  45%    0.12K   3521       30     14084K size-128
  24300  24261  99%    0.14K    900       27      3600K sysfs_dir_cache
  17402  15599  89%    0.05K    226       77       904K anon_vma_chain
  16055  14874  92%    0.20K    845       19      3380K vm_area_struct
   9844   8471  86%    0.04K    107       92       428K anon_vma
   8952   8775  98%    0.58K   1492        6      5968K inode_cache
   7518   5829  77%    0.62K   1253        6      5012K proc_inode_cache
   6840   4692  68%    0.19K    342       20      1368K filp
   5888   5532  93%    0.04K     64       92       256K dm_io


top - 10:10:02 up 22:34,  4 users,  load average: 1.02, 1.15, 1.53
Tasks: 888 total,   1 running, 887 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us,  1.2%sy,  0.0%ni, 97.9%id,  0.1%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:  49421492k total, 43619512k used,  5801980k free,  4409144k buffers
Swap:  8388600k total,    16308k used,  8372292k free, 25837164k cached

Somehow, this doesn't reflect what the kernel complains about when the
OOM killer starts its mission.

That's bugging me too.

I see that you're using  kernel-2.6.32-279.1.1.el6.x86_64 ... that
smells a bit like a SL 6.3 Beta ... is that right?  As SL 6.2 is usually
around 2.6.32-220-something.  I would probably recommend you to try a
6.2 kernel if you're running something much more bleeding edge.

I was running 220-23.1 and it was crashing so I tried the newer one to see if that helped. I think I'll back off now.

And it somehow seems to be related to some file system issues ... at
least from what I can see.  Could be a bugy kernel which leaks memory,
somewhere in either the parition table code or ext4 code paths.

One possibility perhaps. The machine comes up doing a md sync:

md1 : active raid10 sdh1[4] sdb2[0] sda2[1] sdd1[2] sde1[7] sdf1[6] sdc1[3] sdg1[5]
      3906203648 blocks 256K chunks 2 near-copies [8/8] [UUUUUUUU]
[=>...................] resync = 9.2% (362126208/3906203648) finish=1334.6min speed=44255K/sec

I wonder if when that completes some kind of lvm device scan is triggered which causes the problem. I'm not sure what fires off a lvm process in the first place.

--
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder Office                  FAX: 303-415-9702
3380 Mitchell Lane                       [email protected]
Boulder, CO 80301                   http://www.nwra.com

Reply via email to