> On 4 Jul 2017, at 13:00, Eyal Edri <[email protected]> wrote: > > I was able to reproduce the error [1] on a manual run with only new vdsm from > [2], > and also to verify that w/o this change, while using latest tested run [3] it > works. > > So I think this proves quite clearly the problem is one of the latest VDSM > patches.
There is only a single patch between vdsms [1] and [3] https://gerrit.ovirt.org/#/c/78536 > > I'm running again the test with the suspected bad VDSM and hopefully will be > able to extract the env to tar.gz file > which anyone can import using the lago demo tool. > > > > [1] > http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/748/ > > <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/748/> > [2] http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/2694/ > <http://jenkins.ovirt.org/job/vdsm_master_build-artifacts-el7-x86_64/2694/> > [3] > http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/747/ > > <http://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/747/> > > > > On Tue, Jul 4, 2017 at 1:30 PM, Nadav Goldin <[email protected] > <mailto:[email protected]>> wrote: > Hi, sorry for posting late, I had a brief look at this yesterday: > 1. I couldn't replicate it locally - which means it is most likely a > recent change. > 2. I looked at the libvirt XMLs Lago generatd for the hosts, as a new > version is used this week(0.40) - and they seem OK - specifically > memroy and vcpus(which was my initial suspect). > 3. I saw two Engine patches, a bit prior to the time it started to > fail, which *might* in my common sense be related, but it is out of my > scope to tell(CC'ed patch owners): > > core: Make VmAnalyzer to treat a migrated Paused VM as success - > https://gerrit.ovirt.org/78305 <https://gerrit.ovirt.org/78305> > > fix custom fencing default config setting > https://gerrit.ovirt.org/78720 <https://gerrit.ovirt.org/78720> > > Shot in the wild - Could it be that the 'CPUOverload' filter was not > active before for some reason? > > Also, there are some exceptions in host0 vdsm log[1], failing to get > VM stats, though I can't tell if they are specific to this failure. > > Of course this is not a complete analysis, I hope it helps. > > > [1] > http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log > > <http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/lago-basic-suite-master-host0/_var_log/vdsm/vdsm.log> > > > Nadav. > > > > > > On Tue, Jul 4, 2017 at 12:46 PM, Eyal Edri <[email protected] > <mailto:[email protected]>> wrote: > > > > > > On Tue, Jul 4, 2017 at 12:18 PM, Michal Skrivanek > > <[email protected] <mailto:[email protected]>> wrote: > >> > >> > >> On 3 Jul 2017, at 15:35, Shlomo Ben David <[email protected] > >> <mailto:[email protected]>> wrote: > >> > >> Hi, > >> > >> Test failed: [ 006_migrations.migrate_vm ] > >> Link to suspected patches: N/A > >> Link to Job: > >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/ > >> <http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/> > >> Link to all logs: > >> Error snippet from the log: > >> http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/ > >> > >> <http://jenkins.ovirt.org/job/test-repo_ovirt_experimental_master/7431/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-006_migrations.py/> > >> > >> <error> > >> > >> "Fault reason is "Operation Failed". Fault detail is "[Cannot migrate VM. > >> There is no host that satisfies current scheduling constraints. See below > >> for details:, The host lago-basic-suite-master-host0 did not satisfy > >> internal filter CPUOverloaded because its CPU is too loaded.]" > >> > >> </error> > >> > >> <engine log> > >> > >> 2017-07-02 16:43:22,829-04 INFO > >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) > >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock Acquired to object > >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]', > >> sharedLocks=''}' > >> 2017-07-02 16:43:22,833-04 DEBUG > >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored > >> procedure. Call string is [{call getdiskvmelementspluggedtovm(?)}] > >> 2017-07-02 16:43:22,833-04 DEBUG > >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for > >> procedure [GetDiskVmElementsPluggedToVm] compiled > >> 2017-07-02 16:43:22,843-04 DEBUG > >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Compiled stored > >> procedure. Call string is [{call getattacheddisksnapshotstovm(?, ?)}] > >> 2017-07-02 16:43:22,843-04 DEBUG > >> [org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresSimpleJdbcCall] > >> (default task-27) [87508047-fdc5-4a2f-9692-c83f7b55bbc2] SqlCall for > >> procedure [GetAttachedDiskSnapshotsToVm] compiled > >> 2017-07-02 16:43:22,919-04 INFO > >> [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-27) > >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Candidate host > >> 'lago-basic-suite-master-host0' ('46bdc63d-98f5-4eee-81aa-2fb88b8f7cbe') > >> was > >> filtered out by 'VAR__FILTERTYPE__INTERNAL' filter 'CPUOverloaded' > >> (correlation id: null) > >> 2017-07-02 16:43:22,920-04 WARN > >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) > >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Validation of action > >> 'MigrateVmToServer' failed for user admin@internal-authz. Reasons: > >> VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName > >> lago-basic-suite-master-host0,$filterName > >> CPUOverloaded,VAR__DETAIL__CPU_OVERLOADED,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL > >> > >> > >> > >> This has nothing to do with migration > >> The CPUOverload is a scheduling policy, unless there was any change in > >> that area the obvious explanation would be that the host has a CPU overload > >> condition. > >> I briefly looked at logs and see ""cpuUser": "83.40", "cpuSys": "16.59", > >> "cpuIdle": “0.08”” which indeed suggests an overload, from the same sample > >> I > >> can see it’s vdsm ("cpuUserVdsmd": “77.38”, cpuSysVdsmd": “18.44" > >> > >> Since similar values are consistently being reported for some time, and > >> there is a setupNetworks and storage rescan prior to the the failure, and > >> there is no other indication of anything wrong, I’d just say the > >> environment > >> or the order of tests or timing has changed, but nothing wrong with the > >> oVirt code > >> Did any of that changed recently? Does it reproduce locally? > > > > > > AFAIK, no significant environment changes or tests were done. > > We will try to reproduce it locally and also on the manual job, but from > > what it looks it is very consistent (unlike other race failures we've seen > > lately ) and continues to fails on the same tests, so its either a change in > > oVirt or something else that we're not thinking on. > > > >> > >> > >> Thanks, > >> michal > >> > >> 2017-07-02 16:43:22,920-04 INFO > >> [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (default task-27) > >> [87508047-fdc5-4a2f-9692-c83f7b55bbc2] Lock freed to object > >> 'EngineLock:{exclusiveLocks='[2b34910d-cef2-44d6-a274-30e8473eb5d9=VM]', > >> sharedLocks=''}' > >> 2017-07-02 16:43:22,929-04 DEBUG > >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] > >> (DefaultQuartzScheduler7) [] Rescheduling > >> DEFAULT.org.ovirt.engine.core.bll.ColdRebootAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775733 > >> as there is no unfired trigger. > >> 2017-07-02 16:43:22,932-04 ERROR > >> [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default > >> task-27) [] Operation Failed: [Cannot migrate VM. There is no host that > >> satisfies current scheduling constraints. See below for details:, The host > >> lago-basic-suite-master-host0 did not satisfy internal filter CPUOverloaded > >> because its CPU is too loaded.] > >> 2017-07-02 16:43:23,331-04 DEBUG > >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] > >> (DefaultQuartzScheduler2) [] Rescheduling > >> DEFAULT.org.ovirt.engine.core.bll.HaAutoStartVmsRunner.startFailedAutoStartVms#-9223372036854775793 > >> as there is no unfired trigger. > >> 2017-07-02 16:43:23,332-04 DEBUG > >> [org.ovirt.engine.core.utils.timer.FixedDelayJobListener] > >> (DefaultQuartzScheduler2) [] Rescheduling > >> DEFAULT.org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods#-9223372036854775783 > >> as there is no unfired trigger. > >> > >> <engine log> > >> > >> > >> > >> Best Regards, > >> > >> Shlomi Ben-David | Software Engineer | Red Hat ISRAEL > >> RHCSA | RHCVA | RHCE > >> IRC: shlomibendavid (on #rhev-integ, #rhev-dev, #rhev-ci) > >> > >> OPEN SOURCE - 1 4 011 && 011 4 1 > >> > >> _______________________________________________ > >> Devel mailing list > >> [email protected] <mailto:[email protected]> > >> http://lists.ovirt.org/mailman/listinfo/devel > >> <http://lists.ovirt.org/mailman/listinfo/devel> > >> > >> > >> > >> _______________________________________________ > >> Devel mailing list > >> [email protected] <mailto:[email protected]> > >> http://lists.ovirt.org/mailman/listinfo/devel > >> <http://lists.ovirt.org/mailman/listinfo/devel> > > > > > > > > > > -- > > > > Eyal edri > > > > > > ASSOCIATE MANAGER > > > > RHV DevOps > > > > EMEA VIRTUALIZATION R&D > > > > > > Red Hat EMEA > > > > TRIED. TESTED. TRUSTED. > > phone: +972-9-7692018 <tel:%2B972-9-7692018> > > irc: eedri (on #tlv #rhev-dev #rhev-integ) > > > > _______________________________________________ > > Devel mailing list > > [email protected] <mailto:[email protected]> > > http://lists.ovirt.org/mailman/listinfo/devel > > <http://lists.ovirt.org/mailman/listinfo/devel> > > > > -- > EYAL EDRI > > ASSOCIATE MANAGER > RHV DEVOPS > EMEA VIRTUALIZATION R&D > > Red Hat EMEA <https://www.redhat.com/> > <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > phone: +972-9-7692018 <tel:+972%209-769-2018> > irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
