Stefan Priebe - Profihost AG <s.pri...@profihost.ag> 于2018年9月17日周一 上午9:00写道: > > Hi, > > Am 17.09.2018 um 08:38 schrieb Jack Wang: > > Stefan Priebe - Profihost AG <s.pri...@profihost.ag> 于2018年9月16日周日 下午3:31写道: > >> > >> Hello, > >> > >> while overcommiting cpu I had several situations where all vms gone > >> offline while two vms saturated all cores. > >> > >> I believed all vms would stay online but would just not be able to use all > >> their cores? > >> > >> My original idea was to automate live migration on high host load to move > >> vms to another node but that makes only sense if all vms stay online. > >> > >> Is this expected? Anything special needed to archive this? > >> > >> Greets, > >> Stefan > >> > > Hi, Stefan, > > > > Do you have any logs when all VMs go offline? > > Maybe OOMkiller play a role there? > > After reviewing i think this is memory related but OOM did not play a role. > All kvm processes where spinning trying to get > 100% CPU and i was not > able to even login to ssh. After 5-10 minutes i was able to login. So the VMs are not really offline, what the result if you run query-status via qmp? > > There were about 150GB free mem. > > Relevant settings (no local storage involved): > vm.dirty_background_ratio: > 3 > vm.dirty_ratio: > 10 > vm.min_free_kbytes: > 10567004 > > # cat /sys/kernel/mm/transparent_hugepage/defrag > always defer [defer+madvise] madvise never > > # cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > > After that i had the following traces on the host node: > https://pastebin.com/raw/0VhyQmAv
The call trace looks ceph related deadlock or hung. > > Thanks! > > Greets, > Stefan