On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek < [email protected]> wrote:
> > On 19 Dec 2017, at 10:14, Arik Hadas <[email protected]> wrote: > > > > On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek < > [email protected]> wrote: > >> >> > On 18 Dec 2017, at 13:21, Milan Zamazal <[email protected]> wrote: >> > >> > Yedidyah Bar David <[email protected]> writes: >> > >> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <[email protected]> >> wrote: >> >>> Jenkins CI posted comments on this change. >> >>> >> >> >> >>> View Change >> >>> >> >>> Patch set 3:Continuous-Integration -1 >> >>> >> >>> Build Failed >> >>> >> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check >> -patch-el7-x86_64/2882/ >> >>> : FAILURE >> >> >> >> Console output of above job says: >> >> >> >> 08:13:34 # migrate_vm: >> >> 08:16:37 * Collect artifacts: >> >> 08:16:40 * Collect artifacts: Success (in 0:00:03) >> >> 08:16:40 # migrate_vm: Success (in 0:03:06) >> >> 08:16:40 # Results located at >> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migra >> tions.py.junit.xml >> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50) >> >> 08:16:40 Error occured, aborting >> >> >> >> The file 006_migrations.py.junit.xml [1] says: >> >> >> >> <failure type="exceptions.AssertionError" message="False != True after >> >> 180 seconds"> >> > >> > Reading the logs, I can see the VM migrates normally and seems to be >> > reported to Engine correctly. When Engine receives end-of-migration >> > event, it sends Destroy to the source (which is correct), calls dumpxmls >> > on the destination in the meantime (looks fine to me) and then calls >> >> looks like a race between getallvmstats reporting VM as Down (statusTime: >> 4296271980) being processed, while there is a Down/MigrationSucceeded event >> arriving (with notify_time 4296273170) at about the same time >> Unfortunately the vdsm.log is not in DEBUG level so there’s very little >> information as to why and what exactly did it send out. >> @infra - can you enable debug log level for vdsm by default? > > >> It does look like a race to me - does it reproduce? > > >> > Destroy on the destination, which is weird and I don't understand why >> > the Destroy is invoked. >> > >> > Arik, would you like to take a look? Maybe I overlooked something or >> > maybe there's a bug. The logs are at >> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check >> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic- >> suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/ >> > and the interesting things happen around 2017-12-18 03:13:43,758-05. >> > > So it looks like that: > 1. the engine polls the VMs from the source host > 2. right after #1 we get the down event with proper exit reason (= > migration succeeded) but the engine doesn't process it since the VM is > being locked by the monitoring as part of processing that polling (to > prevent two analysis of the same VM simultaneously). > 3. the result of the polling is a VM in status Down and must probably > exit_status=Normal > 4. the engine decides to abort the migration and thus the monitoring > thread of the source host destroys the VM on the destination host. > > Unfortunately we don't have the exit_reason that is returned by the > polling. > However, the only option I can think of is that it is different than > MigrationSucceeded, because otherwise we would have hand-over the VM to the > destination host rather than aborting the migration [1]. > That part of the code recently changed as part of [2] - we used to > hand-over the VM when we get from the source host: > status = Down + exit_status = Normal > And in the database: previous_status = MigrationFrom > But after that change we require: > status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded > ** > And in the database: previous_status = MigrationFrom > > Long story short, is it possible that VDSM had set the status of the VM to > Down and exit_status to Normal but the exit_reason was not updated (yet?) > to MigrationSucceeded? > > > ok, so there might be a plausible explanation > the guest drive mapping introduced a significant delay into the > VM.getStats call since it tries to update the mapping when it detects a > change. That is likely to happen on lifecycle changes. In the OST case it > took 1.2s to finish the whole call, and in the meantime the migration has > finished. The getStats() call is not written with possible state change in > mind, so if it so happens and the state moves from anything to Down in the > middle of it it returns a Down state without exitCode and exitReason which > confuses engine. We started to use the exitReason code to differentiate the > various flavors of Down in engine in ~4.1 and in this case it results in > misleading “VM powered off by admin” case > > we need to fix the VM.getStats() to handle VM state changes in the middle > we need to fix the guest drive mapping updates to handle cleanly > situations when the VM is either not ready yet or already gone > Has anyone filed BZs for both? Can we ensure we do? The bottom line result sounds severe to me. Y. > Thanks, > michal > > > [1] https://github.com/oVirt/ovirt-engine/blob/master/ > backend/manager/modules/vdsbroker/src/main/java/org/ > ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291 > [2] https://gerrit.ovirt.org/#/c/84387/ > > >> > >> >> Can someone please have a look? Thanks. >> >> >> >> As a side note, if indeed this is the cause for the failure for the >> >> job, it's misleading to say "migrate_vm: Success". >> >> >> >> [1] >> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check >> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic- >> suite-master__logs/006_migrations.py.junit.xml >> >> >> >>> >> >>> To view, visit change 85177. To unsubscribe, visit settings. >> >>> >> >>> Gerrit-Project: ovirt-system-tests >> >>> Gerrit-Branch: master >> >>> Gerrit-MessageType: comment >> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5 >> >>> Gerrit-Change-Number: 85177 >> >>> Gerrit-PatchSet: 3 >> >>> Gerrit-Owner: Yedidyah Bar David <[email protected]> >> >>> Gerrit-Reviewer: Dafna Ron <[email protected]> >> >>> Gerrit-Reviewer: Eyal Edri <[email protected]> >> >>> Gerrit-Reviewer: Jenkins CI >> >>> Gerrit-Reviewer: Sandro Bonazzola <[email protected]> >> >>> Gerrit-Reviewer: Yedidyah Bar David <[email protected]> >> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000 >> >>> Gerrit-HasComments: No >> > _______________________________________________ >> > Devel mailing list >> > [email protected] >> > http://lists.ovirt.org/mailman/listinfo/devel >> > >> > >> >> > > > _______________________________________________ > Devel mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/devel >
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
