On 03.03.2015 14:18, Gerhard Wiesinger wrote:
On 03.03.2015 13:28, Gerhard Wiesinger wrote:
On 03.03.2015 10:12, Gerhard Wiesinger wrote:
On 02.03.2015 18:15, Gerhard Wiesinger wrote:
On 02.03.2015 16:52, Gerhard Wiesinger wrote:
On 02.03.2015 10:26, Paolo Bonzini wrote:
On 01/03/2015 11:36, Gerhard Wiesinger wrote:
So far it happened only the PostgreSQL database VM. Kernel is alive
(ping works well). ssh is not working.
console window: after entering one character at login prompt,
then crashed:
[1438.384864] Out of memory: Kill process 10115 (pg_dump) score
112 or
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
anon-rss: 162712kB, file-rss: 220kB
Can you get a vmcore or at least sysrq-t output?
Yes, next time it happens I can analyze it.
I think there are 2 problems:
1.) OOM (Out of Memory) problem with the low memory settings and
kernel settings (see below)
2.) Instability problem which might have a dependency to 1.)
What I've done so far (thanks to Andrey Korolyov for ideas and help):
a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2
virsh dumpxml database | grep "<type"
<type arch='x86_64' machine='pc-0.15'>hvm</type>
virsh edit database
virsh dumpxml database | grep "<type"
<type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
SMBIOS is updated therefore from 2.4 to 2.8:
dmesg|grep -i SMBIOS
[ 0.000000] SMBIOS 2.8 present.
b.) Switched to tsc clock, kernel parameters: clocksource=tsc
nohz=off highres=off
c.) Changed overcommit to 1
echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf
d.) Tried 1 VCPU instead of 2
e.) Installed 512MB vRAM instead of 384MB
f.) Prepared for sysrq and vmcore
echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf
sysctl -w kernel.sysrq=1
virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T
virsh dump domain-name /tmp/dumpfile
g.) Further ideas, not yet done: disable memory balooning by
blacklisting baloon driver or remove from virsh xml config
Summary:
1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM
problem, no crash
2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
problem, no crash
3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM
problem, no crash
3b.) Still happened again at the nightly backup with same
configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU,
vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no
reaction, BUT CRASHED again
3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1,
pc-i440fx-2.2: OOM problem, no crash
postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Free swap = 905924kB
Total swap = 1081340kB
Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child
Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB,
file-rss:0kB
An OOM should not occour:
https://www.kernel.org/doc/gorman/html/understand/understand016.html
Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM
Why does an OOM condition occour? Looks like a bug in the kernel?
Any ideas?
# Allocating 800MB, killed by OOM killer
./mallocsleep 805306368
Killed
Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice
child
Killed process 27160 (mallocsleep) total-vm:790588kB,
anon-rss:214948kB, file-rss:0kB
free -m
total used free shared buff/cache
available
Mem: 363 23 252 23 87 295
Swap: 1055 134 921
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1392
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1392
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
# Maschine is getting inresponsive and stalls for seconds, but never
reaches more than 1055MB swap size (+ 384MB RAM)
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
------cpu-----
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3
91 2 2
0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0
100 0 0
0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0
89 11 0
0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0
100 0 0
0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0
100 0 0
0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1
100 0 0
0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1
99 0 0
0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0
100 0 0
0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0
92 8 1
1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23
66 0 0
2 7 270496 3804 140 61172 1144 412268 15028 412340 92805
301836 1 49 1 27 22
1 12 620320 4788 140 35240 1240 114864 96860 114976 46242
96395 1 26 0 61 12
3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5
76 0 16 4
3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3
71 0 19 7
5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3
78 0 13 6
6 19 660596 4592 144 35884 320 0 233160 8 12401 16798
5 67 0 12 15
3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234
8 67 0 11 14
3 25 677208 4792 136 36044 68 0 524340 12 20637 26558
3 74 0 15 8
2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941
4 78 0 12 5
3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372
3 84 0 6 7
3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219
4 87 0 7 2
3 23 713696 4384 152 38172 796 0 481420 96 16498 27177
8 87 0 4 1
3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901
5 85 0 9 1
3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293
3 91 0 5 1
3 29 714572 3860 172 41076 156 0 920736 152 12131 17339
5 94 0 0 0
4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712
4 90 0 4 2
3 30 725216 4060 136 40604 124 0 286384 156 21992 35505
5 91 0 2 3
8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17
75 0 5 3
0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10
10 78 2 1
0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0
100 0 1
0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1
98 1 0
0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0
100 0 0
0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0
97 3 1
What's wrong here?
Kernel Bug?
Reminds me all of the post here:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html
Last month, these outages began to happen more regularly but also very
randomly. The symptoms were quite similar:
CPU spiked to 100% utilization.
Disk I/O spiked.
Server became completely inaccessible via SSH, etc.
Logs show the Linux Out Of Memory (OOM) killer killing user
processes that have hit their cgroup's memory limit shortly before the
server froze.
Host memory was not under pressure - it was close to fully utilized
(which is normal) but there was a lot of unused swap.
Ciao,
Gerhard