> On 19 Dec 2017, at 15:13, Yaniv Kaul <[email protected]> wrote:
> 
> 
> 
> On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek 
> <[email protected]> wrote:
> 
>> On 19 Dec 2017, at 10:14, Arik Hadas <[email protected]> wrote:
>> 
>> 
>> 
>> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek 
>> <[email protected]> wrote:
>> 
>> > On 18 Dec 2017, at 13:21, Milan Zamazal <[email protected]> wrote:
>> >
>> > Yedidyah Bar David <[email protected]> writes:
>> >
>> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review <[email protected]> wrote:
>> >>> Jenkins CI posted comments on this change.
>> >>>
>> >>
>> >>> View Change
>> >>>
>> >>> Patch set 3:Continuous-Integration -1
>> >>>
>> >>> Build Failed
>> >>>
>> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/
>> >>> : FAILURE
>> >>
>> >> Console output of above job says:
>> >>
>> >> 08:13:34   # migrate_vm:
>> >> 08:16:37     * Collect artifacts:
>> >> 08:16:40     * Collect artifacts: Success (in 0:00:03)
>> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
>> >> 08:16:40   # Results located at
>> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml
>> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
>> >> 08:16:40 Error occured, aborting
>> >>
>> >> The file 006_migrations.py.junit.xml [1] says:
>> >>
>> >> <failure type="exceptions.AssertionError" message="False != True after
>> >> 180 seconds">
>> >
>> > Reading the logs, I can see the VM migrates normally and seems to be
>> > reported to Engine correctly.  When Engine receives end-of-migration
>> > event, it sends Destroy to the source (which is correct), calls dumpxmls
>> > on the destination in the meantime (looks fine to me) and then calls
>> 
>> looks like a race between getallvmstats reporting VM as Down (statusTime: 
>> 4296271980) being processed, while there is a Down/MigrationSucceeded event 
>> arriving (with notify_time 4296273170) at about the same time
>> Unfortunately the vdsm.log is not in DEBUG level so there’s very little 
>> information as to why and what exactly did it send out.
>> @infra - can you enable debug log level for vdsm by default? 
>> 
>> It does look like a race to me - does it reproduce? 
>> 
>> > Destroy on the destination, which is weird and I don't understand why
>> > the Destroy is invoked.
>> >
>> > Arik, would you like to take a look?  Maybe I overlooked something or
>> > maybe there's a bug.  The logs are at
>> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/
>> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
>> 
>> So it looks like that:
>> 1. the engine polls the VMs from the source host
>> 2. right after #1 we get the down event with proper exit reason (= migration 
>> succeeded) but the engine doesn't process it since the VM is being locked by 
>> the monitoring as part of processing that polling (to prevent two analysis 
>> of the same VM simultaneously).
>> 3. the result of the polling is a VM in status Down and must probably 
>> exit_status=Normal
>> 4. the engine decides to abort the migration and thus the monitoring thread 
>> of the source host destroys the VM on the destination host.
>> 
>> Unfortunately we don't have the exit_reason that is returned by the polling.
>> However, the only option I can think of is that it is different than 
>> MigrationSucceeded, because otherwise we would have hand-over the VM to the 
>> destination host rather than aborting the migration [1].
>> That part of the code recently changed as part of [2] - we used to hand-over 
>> the VM when we get from the source host:
>> status = Down + exit_status = Normal 
>> And in the database: previous_status = MigrationFrom
>> But after that change we require:
>> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded **
>> And in the database: previous_status = MigrationFrom
>> 
>> Long story short, is it possible that VDSM had set the status of the VM to 
>> Down and exit_status to Normal but the exit_reason was not updated (yet?) to 
>> MigrationSucceeded?
> 
> ok, so there might be a plausible explanation
> the guest drive mapping introduced a significant delay into the VM.getStats 
> call since it tries to update the mapping when it detects a change. That is 
> likely to happen on lifecycle changes. In the OST case it took 1.2s to finish 
> the whole call, and in the meantime the migration has finished. The 
> getStats() call is not written with possible state change in mind, so if it 
> so happens and the state moves from anything to Down in the middle of it it 
> returns a Down state without exitCode and exitReason which confuses engine. 
> We started to use the exitReason code to differentiate the various flavors of 
> Down in engine in ~4.1 and in this case it results in misleading “VM powered 
> off by admin” case
> 
> we need to fix the VM.getStats() to handle VM state changes in the middle
> we need to fix the guest drive mapping updates to handle cleanly situations 
> when the VM is either not ready yet or already gone
> 
> Has anyone filed BZs for both? Can we ensure we do? The bottom line result 
> sounds severe to me. 

yes, forgot to update here
https://bugzilla.redhat.com/show_bug.cgi?id=1527416

it is indeed severe (your VM gets shut down once it migrates) and should be 
fixed in 4.2.1

> Y.
> 
> 
> Thanks,
> michal
> 
>> 
>> [1] 
>> https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291
>> [2] https://gerrit.ovirt.org/#/c/84387/
>>  
>> >
>> >> Can someone please have a look? Thanks.
>> >>
>> >> As a side note, if indeed this is the cause for the failure for the
>> >> job, it's misleading to say "migrate_vm: Success".
>> >>
>> >> [1]
>> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/006_migrations.py.junit.xml
>> >>
>> >>>
>> >>> To view, visit change 85177. To unsubscribe, visit settings.
>> >>>
>> >>> Gerrit-Project: ovirt-system-tests
>> >>> Gerrit-Branch: master
>> >>> Gerrit-MessageType: comment
>> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5
>> >>> Gerrit-Change-Number: 85177
>> >>> Gerrit-PatchSet: 3
>> >>> Gerrit-Owner: Yedidyah Bar David <[email protected]>
>> >>> Gerrit-Reviewer: Dafna Ron <[email protected]>
>> >>> Gerrit-Reviewer: Eyal Edri <[email protected]>
>> >>> Gerrit-Reviewer: Jenkins CI
>> >>> Gerrit-Reviewer: Sandro Bonazzola <[email protected]>
>> >>> Gerrit-Reviewer: Yedidyah Bar David <[email protected]>
>> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +0000
>> >>> Gerrit-HasComments: No
>> > _______________________________________________
>> > Devel mailing list
>> > [email protected]
>> > http://lists.ovirt.org/mailman/listinfo/devel
>> >
>> >
>> 
>> 
> 
> 
> _______________________________________________
> Devel mailing list
> [email protected]
> http://lists.ovirt.org/mailman/listinfo/devel
> 

_______________________________________________
Devel mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/devel

Reply via email to