> On 19 Dec 2017, at 15:13, Yaniv Kaul <[email protected]> wrote: > > > > On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek > <[email protected]> wrote: > >> On 19 Dec 2017, at 10:14, Arik Hadas <[email protected]> wrote: >> >> >> >> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek >> <[email protected]> wrote: >> >> > On 18 Dec 2017, at 13:21, Milan Zamazal <[email protected]> wrote: >> > >> > Yedidyah Bar David <[email protected]> writes: >> > >> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <[email protected]> wrote: >> >>> Jenkins CI posted comments on this change. >> >>> >> >> >> >>> View Change >> >>> >> >>> Patch set 3:Continuous-Integration -1 >> >>> >> >>> Build Failed >> >>> >> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/ >> >>> : FAILURE >> >> >> >> Console output of above job says: >> >> >> >> 08:13:34 # migrate_vm: >> >> 08:16:37 * Collect artifacts: >> >> 08:16:40 * Collect artifacts: Success (in 0:00:03) >> >> 08:16:40 # migrate_vm: Success (in 0:03:06) >> >> 08:16:40 # Results located at >> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml >> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) >> >> 08:16:40 Error occured, aborting >> >> >> >> The file 006_migrations.py.junit.xml [1] says: >> >> >> >> <failure type="exceptions.AssertionError" message="False != True after >> >> 180 seconds"> >> > >> > Reading the logs, I can see the VM migrates normally and seems to be >> > reported to Engine correctly. When Engine receives end-of-migration >> > event, it sends Destroy to the source (which is correct), calls dumpxmls >> > on the destination in the meantime (looks fine to me) and then calls >> >> looks like a race between getallvmstats reporting VM as Down (statusTime: >> 4296271980) being processed, while there is a Down/MigrationSucceeded event >> arriving (with notify_time 4296273170) at about the same time >> Unfortunately the vdsm.log is not in DEBUG level so there’s very little >> information as to why and what exactly did it send out. >> @infra - can you enable debug log level for vdsm by default? >> >> It does look like a race to me - does it reproduce? >> >> > Destroy on the destination, which is weird and I don't understand why >> > the Destroy is invoked. >> > >> > Arik, would you like to take a look? Maybe I overlooked something or >> > maybe there's a bug. The logs are at >> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/ >> > and the interesting things happen around 2017-12-18 03:13:43,758-05. >> >> So it looks like that: >> 1. the engine polls the VMs from the source host >> 2. right after #1 we get the down event with proper exit reason (= migration >> succeeded) but the engine doesn't process it since the VM is being locked by >> the monitoring as part of processing that polling (to prevent two analysis >> of the same VM simultaneously). >> 3. the result of the polling is a VM in status Down and must probably >> exit_status=Normal >> 4. the engine decides to abort the migration and thus the monitoring thread >> of the source host destroys the VM on the destination host. >> >> Unfortunately we don't have the exit_reason that is returned by the polling. >> However, the only option I can think of is that it is different than >> MigrationSucceeded, because otherwise we would have hand-over the VM to the >> destination host rather than aborting the migration [1]. >> That part of the code recently changed as part of [2] - we used to hand-over >> the VM when we get from the source host: >> status = Down + exit_status = Normal >> And in the database: previous_status = MigrationFrom >> But after that change we require: >> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded ** >> And in the database: previous_status = MigrationFrom >> >> Long story short, is it possible that VDSM had set the status of the VM to >> Down and exit_status to Normal but the exit_reason was not updated (yet?) to >> MigrationSucceeded? > > ok, so there might be a plausible explanation > the guest drive mapping introduced a significant delay into the VM.getStats > call since it tries to update the mapping when it detects a change. That is > likely to happen on lifecycle changes. In the OST case it took 1.2s to finish > the whole call, and in the meantime the migration has finished. The > getStats() call is not written with possible state change in mind, so if it > so happens and the state moves from anything to Down in the middle of it it > returns a Down state without exitCode and exitReason which confuses engine. > We started to use the exitReason code to differentiate the various flavors of > Down in engine in ~4.1 and in this case it results in misleading “VM powered > off by admin” case > > we need to fix the VM.getStats() to handle VM state changes in the middle > we need to fix the guest drive mapping updates to handle cleanly situations > when the VM is either not ready yet or already gone > > Has anyone filed BZs for both? Can we ensure we do? The bottom line result > sounds severe to me.
yes, forgot to update here https://bugzilla.redhat.com/show_bug.cgi?id=1527416 it is indeed severe (your VM gets shut down once it migrates) and should be fixed in 4.2.1 > Y. > > > Thanks, > michal > >> >> [1] >> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291 >> [2] https://gerrit.ovirt.org/#/c/84387/ >> >> > >> >> Can someone please have a look? Thanks. >> >> >> >> As a side note, if indeed this is the cause for the failure for the >> >> job, it's misleading to say "migrate_vm: Success". >> >> >> >> [1] >> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrations.py.junit.xml >> >> >> >>> >> >>> To view, visit change 85177. To unsubscribe, visit settings. >> >>> >> >>> Gerrit-Project: ovirt-system-tests >> >>> Gerrit-Branch: master >> >>> Gerrit-MessageType: comment >> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 >> >>> Gerrit-Change-Number: 85177 >> >>> Gerrit-PatchSet: 3 >> >>> Gerrit-Owner: Yedidyah Bar David <[email protected]> >> >>> Gerrit-Reviewer: Dafna Ron <[email protected]> >> >>> Gerrit-Reviewer: Eyal Edri <[email protected]> >> >>> Gerrit-Reviewer: Jenkins CI >> >>> Gerrit-Reviewer: Sandro Bonazzola <[email protected]> >> >>> Gerrit-Reviewer: Yedidyah Bar David <[email protected]> >> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 >> >>> Gerrit-HasComments: No >> > _______________________________________________ >> > Devel mailing list >> > [email protected] >> > http://lists.ovirt.org/mailman/listinfo/devel >> > >> > >> >> > > > _______________________________________________ > Devel mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/devel > _______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
