Hi, Am 17.09.2018 um 08:38 schrieb Jack Wang: > Stefan Priebe - Profihost AG <s.pri...@profihost.ag> 于2018年9月16日周日 下午3:31写道: >> >> Hello, >> >> while overcommiting cpu I had several situations where all vms gone offline >> while two vms saturated all cores. >> >> I believed all vms would stay online but would just not be able to use all >> their cores? >> >> My original idea was to automate live migration on high host load to move >> vms to another node but that makes only sense if all vms stay online. >> >> Is this expected? Anything special needed to archive this? >> >> Greets, >> Stefan >> > Hi, Stefan, > > Do you have any logs when all VMs go offline? > Maybe OOMkiller play a role there?
After reviewing i think this is memory related but OOM did not play a role. All kvm processes where spinning trying to get > 100% CPU and i was not able to even login to ssh. After 5-10 minutes i was able to login. There were about 150GB free mem. Relevant settings (no local storage involved): vm.dirty_background_ratio: 3 vm.dirty_ratio: 10 vm.min_free_kbytes: 10567004 # cat /sys/kernel/mm/transparent_hugepage/defrag always defer [defer+madvise] madvise never # cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never After that i had the following traces on the host node: https://pastebin.com/raw/0VhyQmAv Thanks! Greets, Stefan