Hi,
Sorry to bother you again. I have more info:
> 1. router with 32MB of RAM (hugepages) and 1VCPU
...
> Is it too much to have 3 guests with hugepages?
OK, this router is also out of equation - I disabled hugepages for it.
There should be also additional pages available to guests because of
that. I think this should be pretty reproducible... Two exactly
similar 64bit Linux 2.6.32 guests with 3500MB of virtual RAM and 4
VCPU each, running on a Core2Quad (4 real cores) machine with 8GB of
RAM and 3546 2MB hugepages on a 64bit Linux 2.6.35 host (libvirt
0.8.3) from Ubuntu Maverick.
Still no swapping and the effect is pretty much the same: one guest
runs well, two guests work for some minutes - then slow down few
hundred times, showing huge load both inside (unlimited rapid growth
of loadaverage) and outside (host load is not making it unresponsive
though - but loaded to the max). Load growth on host is instant and
finite ('r' column change indicate this sudden rise):
# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 3 0 194220 30680 76712 0 0 319 28 2633 1960 6 6 67 20
1 2 0 193776 30680 76712 0 0 4 231 55081 78491 3 39 17 41
10 1 0 185508 30680 76712 0 0 4 87 53042 34212 55 27 9 9
12 0 0 185180 30680 76712 0 0 2 95 41007 21990 84 16 0 0
Thanks,
Dmitry
On Wed, Nov 17, 2010 at 4:19 AM, Dmitry Golubev <[email protected]> wrote:
> Hi,
>
> Maybe you remember that I wrote few weeks ago about KVM cpu load
> problem with hugepages. The problem was lost hanging, however I have
> now some new information. So the description remains, however I have
> decreased both guest memory and the amount of hugepages:
>
> Ram = 8GB, hugepages = 3546
>
> Total of 2 virual machines:
> 1. router with 32MB of RAM (hugepages) and 1VCPU
> 2. linux guest with 3500MB of RAM (hugepages) and 4VCPU
>
> Everything works fine until I start the second linux guest with the
> same 3500MB of guest RAM also in hugepages and also 4VCPU. The rest of
> description is the same as before: after a while the host shows
> loadaverage of about 8 (on a Core2Quad) and it seems that both big
> guests consume exactly the same amount of resources. The hosts seems
> responsive though. Inside the guests, however, things are not so good
> - the load sky rockets to at least 20. Guests are not responsive and
> even a 'ps' executes inappropriately slow (may take few minutes -
> here, however, load builds up and it seems that machine becomes slower
> with time, unlike host, which shows the jump in resource consumption
> instantly). It also seem that the more guests uses memory, the faster
> the problem appers. Still at least a gig of RAM is free on each guest
> and there is no swap activity inside the guest.
>
> The most important thing - why I went back and quoted older message
> than the last one, is that there is no more swap activity on host, so
> the previous track of thought may also be wrong and I returned to the
> beginning. There is plenty of RAM now and swap on host is always on 0
> as seen in 'top'. And there is 100% cpu load, equally shared between
> the two large guests. To stop the load I can destroy either large
> guest. Additionally, I have just discovered that suspending any large
> guest works as well. Moreover, after resume, the load does not come
> back for a while. Both methods stop the high load instantly (faster
> than a second). As you were asking for a 'top' inside the guest, here
> it is:
>
> top - 03:27:27 up 42 min, 1 user, load average: 18.37, 7.68, 3.12
> Tasks: 197 total, 23 running, 174 sleeping, 0 stopped, 0 zombie
> Cpu(s): 0.0%us, 89.2%sy, 0.0%ni, 10.5%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
> Mem: 3510912k total, 1159760k used, 2351152k free, 62568k buffers
> Swap: 4194296k total, 0k used, 4194296k free, 484492k cached
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 12303 root 20 0 0 0 0 R 100 0.0 0:33.72
> vpsnetclean
> 11772 99 20 0 149m 11m 2104 R 82 0.3 0:15.10 httpd
> 10906 99 20 0 149m 11m 2124 R 73 0.3 0:11.52 httpd
> 10247 99 20 0 149m 11m 2128 R 31 0.3 0:05.39 httpd
> 3916 root 20 0 86468 11m 1476 R 16 0.3 0:15.14
> cpsrvd-ssl
> 10919 99 20 0 149m 11m 2124 R 8 0.3 0:03.43 httpd
> 11296 99 20 0 149m 11m 2112 R 7 0.3 0:03.26 httpd
> 12265 99 20 0 149m 11m 2088 R 7 0.3 0:08.01 httpd
> 12317 root 20 0 99.6m 1384 716 R 7 0.0 0:06.57 crond
> 12326 503 20 0 8872 96 72 R 7 0.0 0:01.13 php
> 3634 root 20 0 74804 1176 596 R 6 0.0 0:12.15 crond
> 11864 32005 20 0 87224 13m 2528 R 6 0.4 0:30.84
> cpsrvd-ssl
> 12275 root 20 0 30628 9976 1364 R 6 0.3 0:24.68 cpgs_chk
> 11305 99 20 0 149m 11m 2104 R 6 0.3 0:02.53 httpd
> 12278 root 20 0 8808 1328 968 R 6 0.0 0:04.63 sim
> 1534 root 20 0 0 0 0 S 6 0.0 0:03.29
> flush-254:2
> 3626 root 20 0 149m 13m 5324 R 6 0.4 0:27.62 httpd
> 12279 32008 20 0 87472 7668 2480 R 6 0.2 0:27.63
> munin-update
> 10243 99 20 0 149m 11m 2128 R 5 0.3 0:08.47 httpd
> 12321 root 20 0 99.6m 1460 792 R 5 0.0 0:07.43 crond
> 12325 root 20 0 74804 672 92 R 5 0.0 0:00.76 crond
> 1531 root 20 0 0 0 0 S 2 0.0 0:02.26 kjournald
> 1 root 20 0 10316 756 620 S 0 0.0 0:02.10 init
> 2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd
> 3 root RT 0 0 0 0 S 0 0.0 0:01.08
> migration/0
> 4 root 20 0 0 0 0 S 0 0.0 0:00.02
> ksoftirqd/0
> 5 root RT 0 0 0 0 S 0 0.0 0:00.00
> watchdog/0
> 6 root RT 0 0 0 0 S 0 0.0 0:00.47
> migration/1
> 7 root 20 0 0 0 0 S 0 0.0 0:00.03
> ksoftirqd/1
> 8 root RT 0 0 0 0 S 0 0.0 0:00.00
> watchdog/1
>
>
> The tasks are changing in the 'top' view, so it is nothing like a
> single task hanging - it is more like a machine working off a swap.
> The problem is, however that according to vmstat, there is no swap
> activity during this time. Should I try to decrease RAM I give to my
> guests even more? Is it too much to have 3 guests with hugepages?
> Should I try something else? Unfortunately it is a production system
> and I can't play with it very much.
>
> Here is 'top' on the host:
>
> top - 03:32:12 up 25 days, 23:38, 2 users, load average: 8.50, 5.07, 10.39
> Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie
> Cpu(s): 99.1%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
> Mem: 8193472k total, 8071776k used, 121696k free, 45296k buffers
> Swap: 11716412k total, 0k used, 11714844k free, 197236k cached
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 8426 libvirt- 20 0 3771m 27m 3904 S 199 0.3 10:28.33 kvm
> 8374 libvirt- 20 0 3815m 32m 3908 S 199 0.4 8:11.53 kvm
> 1557 libvirt- 20 0 225m 7720 2092 S 1 0.1 436:54.45 kvm
> 72 root 20 0 0 0 0 S 0 0.0 6:22.54
> kondemand/3
> 379 root 20 0 0 0 0 S 0 0.0 58:20.99 md3_raid5
> 1 root 20 0 23768 1944 1228 S 0 0.0 0:00.95 init
> 2 root 20 0 0 0 0 S 0 0.0 0:00.24 kthreadd
> 3 root 20 0 0 0 0 S 0 0.0 0:12.66
> ksoftirqd/0
> 4 root RT 0 0 0 0 S 0 0.0 0:07.58
> migration/0
> 5 root RT 0 0 0 0 S 0 0.0 0:00.00
> watchdog/0
> 6 root RT 0 0 0 0 S 0 0.0 0:15.05
> migration/1
> 7 root 20 0 0 0 0 S 0 0.0 0:19.64
> ksoftirqd/1
> 8 root RT 0 0 0 0 S 0 0.0 0:00.00
> watchdog/1
> 9 root RT 0 0 0 0 S 0 0.0 0:07.21
> migration/2
> 10 root 20 0 0 0 0 S 0 0.0 0:41.74
> ksoftirqd/2
> 11 root RT 0 0 0 0 S 0 0.0 0:00.00
> watchdog/2
> 12 root RT 0 0 0 0 S 0 0.0 0:13.62
> migration/3
> 13 root 20 0 0 0 0 S 0 0.0 0:24.63
> ksoftirqd/3
> 14 root RT 0 0 0 0 S 0 0.0 0:00.00
> watchdog/3
> 15 root 20 0 0 0 0 S 0 0.0 1:17.11 events/0
> 16 root 20 0 0 0 0 S 0 0.0 1:33.30 events/1
> 17 root 20 0 0 0 0 S 0 0.0 4:15.28 events/2
> 18 root 20 0 0 0 0 S 0 0.0 1:13.49 events/3
> 19 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset
> 20 root 20 0 0 0 0 S 0 0.0 0:00.02 khelper
> 21 root 20 0 0 0 0 S 0 0.0 0:00.00 netns
> 22 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr
> 23 root 20 0 0 0 0 S 0 0.0 0:00.00 pm
> 25 root 20 0 0 0 0 S 0 0.0 0:02.47
> sync_supers
> 26 root 20 0 0 0 0 S 0 0.0 0:03.86
> bdi-default
>
>
> Please help...
>
> Thanks,
> Dmitry
>
> On Sat, Oct 2, 2010 at 1:30 AM, Marcelo Tosatti <[email protected]> wrote:
>>
>> On Thu, Sep 30, 2010 at 12:07:15PM +0300, Dmitry Golubev wrote:
>> > Hi,
>> >
>> > I am not sure what's really happening, but every few hours
>> > (unpredictable) two virtual machines (Linux 2.6.32) start to generate
>> > huge cpu loads. It looks like some kind of loop is unable to complete
>> > or something...
>> >
>> > So the idea is:
>> >
>> > 1. I have two linux 2.6.32 x64 (openvz, proxmox project) guests
>> > running on linux 2.6.35 x64 (ubuntu maverick) host with a Q6600
>> > Core2Quad on qemu-kvm 0.12.5 and libvirt 0.8.3 and another one small
>> > 32bit linux virtual machine (16MB of ram) with a router inside (i
>> > doubt it contributes to the problem).
>> >
>> > 2. All these machines use hufetlbfs. The server has 8GB of RAM, I
>> > reserved 3696 huge pages (page size is 2MB) on the server, and I am
>> > running the main guests each having 3550MB of virtual memory. The
>> > third guest, as I wrote before, takes 16MB of virtual memory.
>> >
>> > 3. Once run, the guests reserve huge pages for themselves normally. As
>> > mem-prealloc is default, they grab all the memory they should have,
>> > leaving 6 pages unreserved (HugePages_Free - HugePages_Rsvd = 6) all
>> > times - so as I understand they should not want to get any more,
>> > right?
>> >
>> > 4. All virtual machines run perfectly normal without any disturbances
>> > for few hours. They do not, however, use all their memory, so maybe
>> > the issue arises when they pass some kind of a threshold.
>> >
>> > 5. At some point of time both guests exhibit cpu load over the top
>> > (16-24). At the same time, host works perfectly well, showing load of
>> > 8 and that both kvm processes use CPU equally and fully. This point of
>> > time is unpredictable - it can be anything from one to twenty hours,
>> > but it will be less than a day. Sometimes the load disappears in a
>> > moment, but usually it stays like that, and everything works extremely
>> > slow (even a 'ps' command executes some 2-5 minutes).
>> >
>> > 6. If I am patient, I can start rebooting the gueat systems - once
>> > they have restarted, everything returns to normal. If I destroy one of
>> > the guests (virsh destroy), the other one starts working normally at
>> > once (!).
>> >
>> > I am relatively new to kvm and I am absolutely lost here. I have not
>> > experienced such problems before, but recently I upgraded from ubuntu
>> > lucid (I think it was linux 2.6.32, qemukvm 0.12.3 and libvirt 0.7.5)
>> > and started to use hugepages. These two virtual machines are not
>> > normally run on the same host system (i have a corosync/pacemaker
>> > cluster with drbd storage), but when one of the hosts is not
>> > abailable, they start running on the same host. That is the reason I
>> > have not noticed this earlier.
>> >
>> > Unfortunately, I don't have any spare hardware to experiment and this
>> > is a production system, so my debugging options are rather limited.
>> >
>> > Do you have any ideas, what could be wrong?
>>
>> Is there swapping activity on the host when this happens?
>>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html