Hi, sorry for posting late, I had a brief look at this yesterday: 1. I couldn't replicate it locally - which means it is most likely a recent change. 2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new version is used this week(0.40) - and they seem OK - specifically memroy and vcpus(which was my initial suspect). 3. I saw two Engine patches, a bit prior to the time it started to fail, which *might* in my common sense be related, but it is out of my scope to tell(CC'ed patch owners):
core: Make VmAnalyzer to treat a migrated Paused VM as success - https://gerrit.ovirt.org/78305 fix custom fencing default config setting https://gerrit.ovirt.org/78720 Shot in the wild - Could it be that the 'CPUOverload' filter was not active before for some reason? Also, there are some exceptions in host0 vdsm log[1], failing to get VM stats, though I can't tell if they are specific to this failure. Of course this is not a complete analysis, I hope it helps. [1] http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log Nadav. On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <[email protected]> wrote: > > > On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek > <[email protected]> wrote: >> >> >> On 3 Jul 2017, at 15:35, Shlomo Ben David <[email protected]> wrote: >> >> Hi, >> >> Test failed: [ 006_migrations.migrate_vm ] >> Link to suspected patches: N/A >> Link to Job: >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/ >> Link to all logs: >> Error snippet from the log: >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/ >> >> <error> >> >> "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM. >> There is no host that satisfies current scheduling constraints. See below >> for details:, The host lago-basic-suite-master-host0 did not satisfy >> internal filter CPUOverloaded because its CPU is too loaded.]" >> >> </error> >> >> <engine log> >> >> 2017-07-02 16:43:22,829-04 INFO >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]', >> sharedLocks=''}' >> 2017-07-02 16:43:22,833-04 DEBUG >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored >> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}] >> 2017-07-02 16:43:22,833-04 DEBUG >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for >> procedure [GetDiskVmElementsPluggedToVm] compiled >> 2017-07-02 16:43:22,843-04 DEBUG >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored >> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}] >> 2017-07-02 16:43:22,843-04 DEBUG >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for >> procedure [GetAttachedDiskSnapshotsToVm] compiled >> 2017-07-02 16:43:22,919-04 INFO >> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27) >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host >> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe') was >> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded' >> (correlation id: null) >> 2017-07-02 16:43:22,920-04 WARN >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action >> 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: >> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName >> lago-basic-suite-master-host0,$filterName >> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL >> >> >> >> This has nothing to do with migration >> The CPUOverload is a scheduling policy, unless there was any change in >> that area the obvious explanation would be that the host has a CPU overload >> condition. >> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59", >> "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample I >> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44" >> >> Since similar values are consistently being reported for some time, and >> there is a setupNetworks and storage rescan prior to the the failure, and >> there is no other indication of anything wrong, I’d just say the environment >> or the order of tests or timing has changed, but nothing wrong with the >> oVirt code >> Did any of that changed recently? Does it reproduce locally? > > > AFAIK, no significant environment changes or tests were done. > We will try to reproduce it locally and also on the manual job, but from > what it looks it is very consistent (unlike other race failures we've seen > lately ) and continues to fails on the same tests, so its either a change in > oVirt or something else that we're not thinking on. > >> >> >> Thanks, >> michal >> >> 2017-07-02 16:43:22,920-04 INFO >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]', >> sharedLocks=''}' >> 2017-07-02 16:43:22,929-04 DEBUG >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] >> (DefaultQuartzScheduler7) [] Rescheduling >> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775733 >> as there is no unfired trigger. >> 2017-07-02 16:43:22,932-04 ERROR >> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default >> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that >> satisfies current scheduling constraints. See below for details:, The host >> lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded >> because its CPU is too loaded.] >> 2017-07-02 16:43:23,331-04 DEBUG >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] >> (DefaultQuartzScheduler2) [] Rescheduling >> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775793 >> as there is no unfired trigger. >> 2017-07-02 16:43:23,332-04 DEBUG >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] >> (DefaultQuartzScheduler2) [] Rescheduling >> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783 >> as there is no unfired trigger. >> >> <engine log> >> >> >> >> Best Regards, >> >> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL >> RHCSA | RHCVA | RHCE >> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci) >> >> OPEN SOURCE - 1 4 011 && 011 4 1 >> >> _______________________________________________ >> Devel mailing list >> [email protected] >> http://lists.ovirt.org/mailman/listinfo/devel >> >> >> >> _______________________________________________ >> Devel mailing list >> [email protected] >> http://lists.ovirt.org/mailman/listinfo/devel > > > > > -- > > Eyal edri > > > ASSOCIATE MANAGER > > RHV DevOps > > EMEA VIRTUALIZATION R&D > > > Red Hat EMEA > > TRIED. TESTED. TRUSTED. > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) > > _______________________________________________ > Devel mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/devel _______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
