Re: [ovirt-devel] oVirt 4.2.0 GA Go / No go

2017-12-19 Thread Martin Sivak
I have my reservations, but perfect is the enemy of good.

+1

Martin

On Tue, Dec 19, 2017 at 8:11 PM, Sandro Bonazzola 
wrote:

>
>
> 2017-12-19 15:21 GMT+01:00 Sandro Bonazzola :
>
>> Hi,
>> we released 4.2.0 RC3 yesterday and currently we don't have approved or
>> proposed blockers.
>> So maintainers, please give your Go / No go for releasing GA tomorrow,
>> December 20th.
>>
>>
> + 1 for integration team
> + 1 for oVirt Node team (forwarding Ryan's vote)
>
>
>
>> Thanks,
>>
>> --
>>
>> SANDRO BONAZZOLA
>>
>> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>>
>> Red Hat EMEA 
>> 
>> TRIED. TESTED. TRUSTED. 
>>
>>
>
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>
> Red Hat EMEA 
> 
> TRIED. TESTED. TRUSTED. 
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA Go / No go

2017-12-19 Thread Sandro Bonazzola
2017-12-19 15:21 GMT+01:00 Sandro Bonazzola :

> Hi,
> we released 4.2.0 RC3 yesterday and currently we don't have approved or
> proposed blockers.
> So maintainers, please give your Go / No go for releasing GA tomorrow,
> December 20th.
>
>
+ 1 for integration team
+ 1 for oVirt Node team (forwarding Ryan's vote)



> Thanks,
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>
> Red Hat EMEA 
> 
> TRIED. TESTED. TRUSTED. 
>
>


-- 

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R

Red Hat EMEA 

TRIED. TESTED. TRUSTED. 
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA Go / No go

2017-12-19 Thread Martin Perina
+1 from me

On Tue, Dec 19, 2017 at 6:03 PM, Michal Skrivanek <
michal.skriva...@redhat.com> wrote:

> fine with me too
>
> On 19 Dec 2017, at 15:45, Oved Ourfali  wrote:
>
> +1 on my end.
>
> On Tue, Dec 19, 2017 at 4:21 PM, Sandro Bonazzola 
> wrote:
>
>> Hi,
>> we released 4.2.0 RC3 yesterday and currently we don't have approved or
>> proposed blockers.
>> So maintainers, please give your Go / No go for releasing GA tomorrow,
>> December 20th.
>>
>> Thanks,
>>
>> --
>> SANDRO BONAZZOLA
>>
>> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>> Red Hat EMEA 
>> 
>> TRIED. TESTED. TRUSTED. 
>>
>>
>> ___
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>



-- 
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA Go / No go

2017-12-19 Thread Michal Skrivanek
fine with me too

> On 19 Dec 2017, at 15:45, Oved Ourfali  wrote:
> 
> +1 on my end.
> 
> On Tue, Dec 19, 2017 at 4:21 PM, Sandro Bonazzola  > wrote:
> Hi,
> we released 4.2.0 RC3 yesterday and currently we don't have approved or 
> proposed blockers.
> So maintainers, please give your Go / No go for releasing GA tomorrow, 
> December 20th.
> 
> Thanks,
> 
> -- 
> SANDRO BONAZZOLA
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
> Red Hat EMEA 
>   
> TRIED. TESTED. TRUSTED. 
> 
> ___
> Devel mailing list
> Devel@ovirt.org 
> http://lists.ovirt.org/mailman/listinfo/devel 
> 
> 
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel

___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA Go / No go

2017-12-19 Thread Oved Ourfali
+1 on my end.

On Tue, Dec 19, 2017 at 4:21 PM, Sandro Bonazzola 
wrote:

> Hi,
> we released 4.2.0 RC3 yesterday and currently we don't have approved or
> proposed blockers.
> So maintainers, please give your Go / No go for releasing GA tomorrow,
> December 20th.
>
> Thanks,
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>
> Red Hat EMEA 
> 
> TRIED. TESTED. TRUSTED. 
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

[ovirt-devel] oVirt 4.2.0 GA Go / No go

2017-12-19 Thread Sandro Bonazzola
Hi,
we released 4.2.0 RC3 yesterday and currently we don't have approved or
proposed blockers.
So maintainers, please give your Go / No go for releasing GA tomorrow,
December 20th.

Thanks,

-- 

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R

Red Hat EMEA 

TRIED. TESTED. TRUSTED. 
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] Migration failed

2017-12-19 Thread Michal Skrivanek

> On 19 Dec 2017, at 15:13, Yaniv Kaul  wrote:
> 
> 
> 
> On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek 
>  wrote:
> 
>> On 19 Dec 2017, at 10:14, Arik Hadas  wrote:
>> 
>> 
>> 
>> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek 
>>  wrote:
>> 
>> > On 18 Dec 2017, at 13:21, Milan Zamazal  wrote:
>> >
>> > Yedidyah Bar David  writes:
>> >
>> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review  wrote:
>> >>> Jenkins CI posted comments on this change.
>> >>>
>> >>
>> >>> View Change
>> >>>
>> >>> Patch set 3:Continuous-Integration -1
>> >>>
>> >>> Build Failed
>> >>>
>> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/
>> >>> : FAILURE
>> >>
>> >> Console output of above job says:
>> >>
>> >> 08:13:34   # migrate_vm:
>> >> 08:16:37 * Collect artifacts:
>> >> 08:16:40 * Collect artifacts: Success (in 0:00:03)
>> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
>> >> 08:16:40   # Results located at
>> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml
>> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
>> >> 08:16:40 Error occured, aborting
>> >>
>> >> The file 006_migrations.py.junit.xml [1] says:
>> >>
>> >> 
>> >
>> > Reading the logs, I can see the VM migrates normally and seems to be
>> > reported to Engine correctly.  When Engine receives end-of-migration
>> > event, it sends Destroy to the source (which is correct), calls dumpxmls
>> > on the destination in the meantime (looks fine to me) and then calls
>> 
>> looks like a race between getallvmstats reporting VM as Down (statusTime: 
>> 4296271980) being processed, while there is a Down/MigrationSucceeded event 
>> arriving (with notify_time 4296273170) at about the same time
>> Unfortunately the vdsm.log is not in DEBUG level so there’s very little 
>> information as to why and what exactly did it send out.
>> @infra - can you enable debug log level for vdsm by default? 
>> 
>> It does look like a race to me - does it reproduce? 
>> 
>> > Destroy on the destination, which is weird and I don't understand why
>> > the Destroy is invoked.
>> >
>> > Arik, would you like to take a look?  Maybe I overlooked something or
>> > maybe there's a bug.  The logs are at
>> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/
>> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
>> 
>> So it looks like that:
>> 1. the engine polls the VMs from the source host
>> 2. right after #1 we get the down event with proper exit reason (= migration 
>> succeeded) but the engine doesn't process it since the VM is being locked by 
>> the monitoring as part of processing that polling (to prevent two analysis 
>> of the same VM simultaneously).
>> 3. the result of the polling is a VM in status Down and must probably 
>> exit_status=Normal
>> 4. the engine decides to abort the migration and thus the monitoring thread 
>> of the source host destroys the VM on the destination host.
>> 
>> Unfortunately we don't have the exit_reason that is returned by the polling.
>> However, the only option I can think of is that it is different than 
>> MigrationSucceeded, because otherwise we would have hand-over the VM to the 
>> destination host rather than aborting the migration [1].
>> That part of the code recently changed as part of [2] - we used to hand-over 
>> the VM when we get from the source host:
>> status = Down + exit_status = Normal 
>> And in the database: previous_status = MigrationFrom
>> But after that change we require:
>> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded **
>> And in the database: previous_status = MigrationFrom
>> 
>> Long story short, is it possible that VDSM had set the status of the VM to 
>> Down and exit_status to Normal but the exit_reason was not updated (yet?) to 
>> MigrationSucceeded?
> 
> ok, so there might be a plausible explanation
> the guest drive mapping introduced a significant delay into the VM.getStats 
> call since it tries to update the mapping when it detects a change. That is 
> likely to happen on lifecycle changes. In the OST case it took 1.2s to finish 
> the whole call, and in the meantime the migration has finished. The 
> getStats() call is not written with possible state change in mind, so if it 
> so happens and the state moves from anything to Down in the middle of it it 
> returns a Down state without exitCode and exitReason which confuses engine. 
> We started to use the exitReason code to differentiate the various flavors of 
> Down in engine in ~4.1 and in this case it results in misleading “VM powered 
> off by admin” case
> 
> we need to fix the VM.getStats() to handle VM state changes in the 

Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Sandro Bonazzola
2017-12-19 15:15 GMT+01:00 Oved Ourfali :

> Based on the above, removing the blocker flag from the bug.
>

Thanks. So current status is RC3 has no known blockers. I'll send a go/no
go email for releasing GA tomorrow.




>
> On Tue, Dec 19, 2017 at 3:03 PM, Martin Sivak  wrote:
>
>> There is no misuse when optional argument is not used. But hosted engine
>> uses the timeout parameter anyway so you can assume it is not a blocker for
>> us.
>>
>> Martin
>>
>> On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina 
>> wrote:
>>
>>> As Irit mentioned the provided reproduction steps are wrong (misuse of
>>> the code) and she posted correct example showing that jsonrpc code works as
>>> expected. So Martin/Simone are you using somewhere in HE code the original
>>> example that is misusing the client?
>>>
>>> Thanks
>>>
>>> Martin
>>>
>>>
>>> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali 
>>> wrote:
>>>
 From the latest comment it doesn't seem like a blocker to me.
 Martin S. - your thoughts?

 On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola 
 wrote:

> We have a proposed blocker for the release:
> 1527155  Infra
> vdsm Bindings-API igoih...@redhat.com NEW jsonrpc reconnect logic
> does not work and gets stuck
>  urgent
> unspecified ovirt-4.2.0 04:30:30
>
> Please review and either approve the blcoker or postpone to 4.2.1.
> Thanks,
>
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>
> Red Hat EMEA 
> 
> TRIED. TESTED. TRUSTED. 
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>


>>>
>>>
>>> --
>>> Martin Perina
>>> Associate Manager, Software Engineering
>>> Red Hat Czech s.r.o.
>>>
>>
>>
>


-- 

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R

Red Hat EMEA 

TRIED. TESTED. TRUSTED. 
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] Migration failed

2017-12-19 Thread Yaniv Kaul
On Tue, Dec 19, 2017 at 1:36 PM, Michal Skrivanek <
michal.skriva...@redhat.com> wrote:

>
> On 19 Dec 2017, at 10:14, Arik Hadas  wrote:
>
>
>
> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek <
> michal.skriva...@redhat.com> wrote:
>
>>
>> > On 18 Dec 2017, at 13:21, Milan Zamazal  wrote:
>> >
>> > Yedidyah Bar David  writes:
>> >
>> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review 
>> wrote:
>> >>> Jenkins CI posted comments on this change.
>> >>>
>> >>
>> >>> View Change
>> >>>
>> >>> Patch set 3:Continuous-Integration -1
>> >>>
>> >>> Build Failed
>> >>>
>> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
>> -patch-el7-x86_64/2882/
>> >>> : FAILURE
>> >>
>> >> Console output of above job says:
>> >>
>> >> 08:13:34   # migrate_vm:
>> >> 08:16:37 * Collect artifacts:
>> >> 08:16:40 * Collect artifacts: Success (in 0:00:03)
>> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
>> >> 08:16:40   # Results located at
>> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migra
>> tions.py.junit.xml
>> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
>> >> 08:16:40 Error occured, aborting
>> >>
>> >> The file 006_migrations.py.junit.xml [1] says:
>> >>
>> >> 
>> >
>> > Reading the logs, I can see the VM migrates normally and seems to be
>> > reported to Engine correctly.  When Engine receives end-of-migration
>> > event, it sends Destroy to the source (which is correct), calls dumpxmls
>> > on the destination in the meantime (looks fine to me) and then calls
>>
>> looks like a race between getallvmstats reporting VM as Down (statusTime:
>> 4296271980) being processed, while there is a Down/MigrationSucceeded event
>> arriving (with notify_time 4296273170) at about the same time
>> Unfortunately the vdsm.log is not in DEBUG level so there’s very little
>> information as to why and what exactly did it send out.
>> @infra - can you enable debug log level for vdsm by default?
>
>
>> It does look like a race to me - does it reproduce?
>
>
>> > Destroy on the destination, which is weird and I don't understand why
>> > the Destroy is invoked.
>> >
>> > Arik, would you like to take a look?  Maybe I overlooked something or
>> > maybe there's a bug.  The logs are at
>> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
>> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic-
>> suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/
>> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
>>
>
> So it looks like that:
> 1. the engine polls the VMs from the source host
> 2. right after #1 we get the down event with proper exit reason (=
> migration succeeded) but the engine doesn't process it since the VM is
> being locked by the monitoring as part of processing that polling (to
> prevent two analysis of the same VM simultaneously).
> 3. the result of the polling is a VM in status Down and must probably
> exit_status=Normal
> 4. the engine decides to abort the migration and thus the monitoring
> thread of the source host destroys the VM on the destination host.
>
> Unfortunately we don't have the exit_reason that is returned by the
> polling.
> However, the only option I can think of is that it is different than
> MigrationSucceeded, because otherwise we would have hand-over the VM to the
> destination host rather than aborting the migration [1].
> That part of the code recently changed as part of [2] - we used to
> hand-over the VM when we get from the source host:
> status = Down + exit_status = Normal
> And in the database: previous_status = MigrationFrom
> But after that change we require:
> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded
> **
> And in the database: previous_status = MigrationFrom
>
> Long story short, is it possible that VDSM had set the status of the VM to
> Down and exit_status to Normal but the exit_reason was not updated (yet?)
> to MigrationSucceeded?
>
>
> ok, so there might be a plausible explanation
> the guest drive mapping introduced a significant delay into the
> VM.getStats call since it tries to update the mapping when it detects a
> change. That is likely to happen on lifecycle changes. In the OST case it
> took 1.2s to finish the whole call, and in the meantime the migration has
> finished. The getStats() call is not written with possible state change in
> mind, so if it so happens and the state moves from anything to Down in the
> middle of it it returns a Down state without exitCode and exitReason which
> confuses engine. We started to use the exitReason code to differentiate the
> various flavors of Down in engine in ~4.1 and in this case it results in
> misleading “VM powered off by admin” case
>
> we need to fix the VM.getStats() to handle VM state changes in the middle
> we need to fix the guest drive mapping updates to handle cleanly
> situations when the VM is either not ready yet 

Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Martin Perina
On Tue, Dec 19, 2017 at 2:05 PM, Martin Sivak  wrote:

> >> So I think that we still have to fix it somehow.
> >> Are we really sure that nr_retries=2 and _timeout=20 are really the
> magic numbers that works on every conditions?
> >
> >
> > No, it should be tested on HE environment and it depends on your usage.
>
> What does happen when only timeout is specified and the connection
> fails after the command is sent? What are the defaults in that case?
>

​So, there are not magic numbers to fit all, here's the description​

​of those parameters:

nr_retries
  - number of reconnection retries
  - if not specified than defualt is 1

_timeout​
  - it's maximum time to wait for reply of a command/veb if client is
connected
  - this does not affect reconnection any way, meaning the client could
reconnect for example for 10 minutes (using high enough nr_retries value)
and yet this timeout may not be reached

So here are 2 suggestions:

1. Set nr_retries=0 and client will behave the same way as in 4.1 (meaning
no reconnection performed)

2. Set nr_retries to high enough number (for example 100 000) and hope that
this number of retries is enough for host being deployed using host deploy.
I know that setting this number is tricky for HE, because the host deploy
can take really different amount of time, but there's no exact way how to
define exact timeout of single reconnection as it depends to the failure
during attempt to connect.

AFAIK no other functionality was mentioned in [1], so nothing else is
implemented. If there are other requirement for reconnection functionality,
then let's open a new RFE for that and discuss it

Thanks

Martin

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1376843


> Martin
>
> On Tue, Dec 19, 2017 at 1:55 PM, Irit Goihman  wrote:
> >
> >
> >
> > On Tue, Dec 19, 2017 at 2:51 PM, Simone Tiraboschi 
> wrote:
> >>
> >>
> >>
> >> On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina 
> wrote:
> >>>
> >>> As Irit mentioned the provided reproduction steps are wrong (misuse of
> the code) and she posted correct example showing that jsonrpc code works as
> expected. So Martin/Simone are you using somewhere in HE code the original
> example that is misusing the client?
> >>
> >>
> >> According to
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1527155#c9
> >> It works in Irit example, at least on that host with that load and
> timings, setting nr_retries=2 and _timeout=20
> >>
> >> While we have _timeout=5 and no custom nr_retries
> >> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/
> master/ovirt_hosted_engine_ha/lib/util.py#L417
> >>
> >> So I think that we still have to fix it somehow.
> >> Are we really sure that nr_retries=2 and _timeout=20 are really the
> magic numbers that works on every conditions?
> >
> >
> > No, it should be tested on HE environment and it depends on your usage.
> >>
> >>
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Martin
> >>>
> >>>
> >>> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali 
> wrote:
> 
>  From the latest comment it doesn't seem like a blocker to me.
>  Martin S. - your thoughts?
> 
>  On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola <
> sbona...@redhat.com> wrote:
> >
> > We have a proposed blocker for the release:
> > 1527155InfravdsmBindings-APIigoihman@redhat.comNEWjsonrpc reconnect
> logic does not work and gets stuckurgentunspecifiedovirt-4.2.004:30:30
> >
> > Please review and either approve the blcoker or postpone to 4.2.1.
> > Thanks,
> >
> >
> > --
> >
> > SANDRO BONAZZOLA
> >
> > ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
> >
> > Red Hat EMEA
> >
> > TRIED. TESTED. TRUSTED.
> >
> >
> > ___
> > Devel mailing list
> > Devel@ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/devel
> 
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Martin Perina
> >>> Associate Manager, Software Engineering
> >>> Red Hat Czech s.r.o.
> >>
> >>
> >
> >
> >
> > --
> >
> > IRIT GOIHMAN
> >
> > SOFTWARE ENGINEER
> >
> > EMEA VIRTUALIZATION R
> >
> > Red Hat EMEA
> >
> > TRIED. TESTED. TRUSTED.
> > @redhatnews   Red Hat   Red Hat
>



-- 
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Martin Sivak
> The only control that we expose for reconnect is number of retires.

Right, but what is the default value? Because I see reconnect attempt
even with no nr_retries and no timeout provided and then it gets stuck
(or waiting for a really long timeout).

Martin

On Tue, Dec 19, 2017 at 2:28 PM, Piotr Kliczewski
 wrote:
> On Tue, Dec 19, 2017 at 2:05 PM, Martin Sivak  wrote:
 So I think that we still have to fix it somehow.
 Are we really sure that nr_retries=2 and _timeout=20 are really the magic 
 numbers that works on every conditions?
>>>
>>>
>>> No, it should be tested on HE environment and it depends on your usage.
>>
>> What does happen when only timeout is specified and the connection
>> fails after the command is sent? What are the defaults in that case?
>
> Timeout defines how long we are going to wait for a response. It is
> there in the code
> for a bit of time already and it is not related to reconnect.
>
> The only control that we expose for reconnect is number of retires.
>
>>
>> Martin
>>
>> On Tue, Dec 19, 2017 at 1:55 PM, Irit Goihman  wrote:
>>>
>>>
>>>
>>> On Tue, Dec 19, 2017 at 2:51 PM, Simone Tiraboschi  
>>> wrote:



 On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina  wrote:
>
> As Irit mentioned the provided reproduction steps are wrong (misuse of 
> the code) and she posted correct example showing that jsonrpc code works 
> as expected. So Martin/Simone are you using somewhere in HE code the 
> original example that is misusing the client?


 According to
 https://bugzilla.redhat.com/show_bug.cgi?id=1527155#c9
 It works in Irit example, at least on that host with that load and 
 timings, setting nr_retries=2 and _timeout=20

 While we have _timeout=5 and no custom nr_retries
 https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L417

 So I think that we still have to fix it somehow.
 Are we really sure that nr_retries=2 and _timeout=20 are really the magic 
 numbers that works on every conditions?
>>>
>>>
>>> No, it should be tested on HE environment and it depends on your usage.


>
>
> Thanks
>
> Martin
>
>
> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali  
> wrote:
>>
>> From the latest comment it doesn't seem like a blocker to me.
>> Martin S. - your thoughts?
>>
>> On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola  
>> wrote:
>>>
>>> We have a proposed blocker for the release:
>>> 1527155InfravdsmBindings-APIigoihman@redhat.comNEWjsonrpc reconnect 
>>> logic does not work and gets stuckurgentunspecifiedovirt-4.2.004:30:30
>>>
>>> Please review and either approve the blcoker or postpone to 4.2.1.
>>> Thanks,
>>>
>>>
>>> --
>>>
>>> SANDRO BONAZZOLA
>>>
>>> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>>>
>>> Red Hat EMEA
>>>
>>> TRIED. TESTED. TRUSTED.
>>>
>>>
>>> ___
>>> Devel mailing list
>>> Devel@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>>
>
>
>
> --
> Martin Perina
> Associate Manager, Software Engineering
> Red Hat Czech s.r.o.


>>>
>>>
>>>
>>> --
>>>
>>> IRIT GOIHMAN
>>>
>>> SOFTWARE ENGINEER
>>>
>>> EMEA VIRTUALIZATION R
>>>
>>> Red Hat EMEA
>>>
>>> TRIED. TESTED. TRUSTED.
>>> @redhatnews   Red Hat   Red Hat
>> ___
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel


Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Piotr Kliczewski
On Tue, Dec 19, 2017 at 2:05 PM, Martin Sivak  wrote:
>>> So I think that we still have to fix it somehow.
>>> Are we really sure that nr_retries=2 and _timeout=20 are really the magic 
>>> numbers that works on every conditions?
>>
>>
>> No, it should be tested on HE environment and it depends on your usage.
>
> What does happen when only timeout is specified and the connection
> fails after the command is sent? What are the defaults in that case?

Timeout defines how long we are going to wait for a response. It is
there in the code
for a bit of time already and it is not related to reconnect.

The only control that we expose for reconnect is number of retires.

>
> Martin
>
> On Tue, Dec 19, 2017 at 1:55 PM, Irit Goihman  wrote:
>>
>>
>>
>> On Tue, Dec 19, 2017 at 2:51 PM, Simone Tiraboschi  
>> wrote:
>>>
>>>
>>>
>>> On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina  wrote:

 As Irit mentioned the provided reproduction steps are wrong (misuse of the 
 code) and she posted correct example showing that jsonrpc code works as 
 expected. So Martin/Simone are you using somewhere in HE code the original 
 example that is misusing the client?
>>>
>>>
>>> According to
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1527155#c9
>>> It works in Irit example, at least on that host with that load and timings, 
>>> setting nr_retries=2 and _timeout=20
>>>
>>> While we have _timeout=5 and no custom nr_retries
>>> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L417
>>>
>>> So I think that we still have to fix it somehow.
>>> Are we really sure that nr_retries=2 and _timeout=20 are really the magic 
>>> numbers that works on every conditions?
>>
>>
>> No, it should be tested on HE environment and it depends on your usage.
>>>
>>>


 Thanks

 Martin


 On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali  wrote:
>
> From the latest comment it doesn't seem like a blocker to me.
> Martin S. - your thoughts?
>
> On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola  
> wrote:
>>
>> We have a proposed blocker for the release:
>> 1527155InfravdsmBindings-APIigoihman@redhat.comNEWjsonrpc reconnect 
>> logic does not work and gets stuckurgentunspecifiedovirt-4.2.004:30:30
>>
>> Please review and either approve the blcoker or postpone to 4.2.1.
>> Thanks,
>>
>>
>> --
>>
>> SANDRO BONAZZOLA
>>
>> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>>
>> Red Hat EMEA
>>
>> TRIED. TESTED. TRUSTED.
>>
>>
>> ___
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>
>



 --
 Martin Perina
 Associate Manager, Software Engineering
 Red Hat Czech s.r.o.
>>>
>>>
>>
>>
>>
>> --
>>
>> IRIT GOIHMAN
>>
>> SOFTWARE ENGINEER
>>
>> EMEA VIRTUALIZATION R
>>
>> Red Hat EMEA
>>
>> TRIED. TESTED. TRUSTED.
>> @redhatnews   Red Hat   Red Hat
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel


Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Martin Sivak
There is no misuse when optional argument is not used. But hosted engine
uses the timeout parameter anyway so you can assume it is not a blocker for
us.

Martin

On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina  wrote:

> As Irit mentioned the provided reproduction steps are wrong (misuse of the
> code) and she posted correct example showing that jsonrpc code works as
> expected. So Martin/Simone are you using somewhere in HE code the original
> example that is misusing the client?
>
> Thanks
>
> Martin
>
>
> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali 
> wrote:
>
>> From the latest comment it doesn't seem like a blocker to me.
>> Martin S. - your thoughts?
>>
>> On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola 
>> wrote:
>>
>>> We have a proposed blocker for the release:
>>> 1527155  Infra vdsm
>>> Bindings-API igoih...@redhat.com NEW jsonrpc reconnect logic does not
>>> work and gets stuck
>>>  urgent unspecified
>>> ovirt-4.2.0 04:30:30
>>>
>>> Please review and either approve the blcoker or postpone to 4.2.1.
>>> Thanks,
>>>
>>>
>>> --
>>>
>>> SANDRO BONAZZOLA
>>>
>>> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>>>
>>> Red Hat EMEA 
>>> 
>>> TRIED. TESTED. TRUSTED. 
>>>
>>>
>>> ___
>>> Devel mailing list
>>> Devel@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>
>>
>
>
> --
> Martin Perina
> Associate Manager, Software Engineering
> Red Hat Czech s.r.o.
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Martin Sivak
>> So I think that we still have to fix it somehow.
>> Are we really sure that nr_retries=2 and _timeout=20 are really the magic 
>> numbers that works on every conditions?
>
>
> No, it should be tested on HE environment and it depends on your usage.

What does happen when only timeout is specified and the connection
fails after the command is sent? What are the defaults in that case?

Martin

On Tue, Dec 19, 2017 at 1:55 PM, Irit Goihman  wrote:
>
>
>
> On Tue, Dec 19, 2017 at 2:51 PM, Simone Tiraboschi  
> wrote:
>>
>>
>>
>> On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina  wrote:
>>>
>>> As Irit mentioned the provided reproduction steps are wrong (misuse of the 
>>> code) and she posted correct example showing that jsonrpc code works as 
>>> expected. So Martin/Simone are you using somewhere in HE code the original 
>>> example that is misusing the client?
>>
>>
>> According to
>> https://bugzilla.redhat.com/show_bug.cgi?id=1527155#c9
>> It works in Irit example, at least on that host with that load and timings, 
>> setting nr_retries=2 and _timeout=20
>>
>> While we have _timeout=5 and no custom nr_retries
>> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L417
>>
>> So I think that we still have to fix it somehow.
>> Are we really sure that nr_retries=2 and _timeout=20 are really the magic 
>> numbers that works on every conditions?
>
>
> No, it should be tested on HE environment and it depends on your usage.
>>
>>
>>>
>>>
>>> Thanks
>>>
>>> Martin
>>>
>>>
>>> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali  wrote:

 From the latest comment it doesn't seem like a blocker to me.
 Martin S. - your thoughts?

 On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola  
 wrote:
>
> We have a proposed blocker for the release:
> 1527155InfravdsmBindings-APIigoihman@redhat.comNEWjsonrpc reconnect logic 
> does not work and gets stuckurgentunspecifiedovirt-4.2.004:30:30
>
> Please review and either approve the blcoker or postpone to 4.2.1.
> Thanks,
>
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>
> Red Hat EMEA
>
> TRIED. TESTED. TRUSTED.
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel


>>>
>>>
>>>
>>> --
>>> Martin Perina
>>> Associate Manager, Software Engineering
>>> Red Hat Czech s.r.o.
>>
>>
>
>
>
> --
>
> IRIT GOIHMAN
>
> SOFTWARE ENGINEER
>
> EMEA VIRTUALIZATION R
>
> Red Hat EMEA
>
> TRIED. TESTED. TRUSTED.
> @redhatnews   Red Hat   Red Hat
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel


Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Irit Goihman
On Tue, Dec 19, 2017 at 2:51 PM, Simone Tiraboschi 
wrote:

>
>
> On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina 
> wrote:
>
>> As Irit mentioned the provided reproduction steps are wrong (misuse of
>> the code) and she posted correct example showing that jsonrpc code works as
>> expected. So Martin/Simone are you using somewhere in HE code the original
>> example that is misusing the client?
>>
>
> According to
> https://bugzilla.redhat.com/show_bug.cgi?id=1527155#c9
> It works in Irit example, at least on that host with that load and
> timings, setting nr_retries=2 and _timeout=20
>
> While we have _timeout=5 and no custom nr_retries
> https://github.com/oVirt/ovirt-hosted-engine-ha/blob/
> master/ovirt_hosted_engine_ha/lib/util.py#L417
>
> So I think that we still have to fix it somehow.
> Are we really sure that nr_retries=2 and _timeout=20 are really the magic
> numbers that works on every conditions?
>

No, it should be tested on HE environment and it depends on your usage.

>
>
>>
>> Thanks
>>
>> Martin
>>
>>
>> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali 
>> wrote:
>>
>>> From the latest comment it doesn't seem like a blocker to me.
>>> Martin S. - your thoughts?
>>>
>>> On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola 
>>> wrote:
>>>
 We have a proposed blocker for the release:
 1527155  Infra
 vdsm Bindings-API igoih...@redhat.com NEW jsonrpc reconnect logic does
 not work and gets stuck
  urgent
 unspecified ovirt-4.2.0 04:30:30

 Please review and either approve the blcoker or postpone to 4.2.1.
 Thanks,


 --

 SANDRO BONAZZOLA

 ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R

 Red Hat EMEA 
 
 TRIED. TESTED. TRUSTED. 


 ___
 Devel mailing list
 Devel@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/devel

>>>
>>>
>>
>>
>> --
>> Martin Perina
>> Associate Manager, Software Engineering
>> Red Hat Czech s.r.o.
>>
>
>


-- 

IRIT GOIHMAN

SOFTWARE ENGINEER

EMEA VIRTUALIZATION R

Red Hat EMEA 


TRIED. TESTED. TRUSTED. 
@redhatnews    Red Hat
   Red Hat

___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Simone Tiraboschi
On Tue, Dec 19, 2017 at 12:56 PM, Martin Perina  wrote:

> As Irit mentioned the provided reproduction steps are wrong (misuse of the
> code) and she posted correct example showing that jsonrpc code works as
> expected. So Martin/Simone are you using somewhere in HE code the original
> example that is misusing the client?
>

According to
https://bugzilla.redhat.com/show_bug.cgi?id=1527155#c9
It works in Irit example, at least on that host with that load and timings,
setting nr_retries=2 and _timeout=20

While we have _timeout=5 and no custom nr_retries
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/lib/util.py#L417

So I think that we still have to fix it somehow.
Are we really sure that nr_retries=2 and _timeout=20 are really the magic
numbers that works on every conditions?


>
> Thanks
>
> Martin
>
>
> On Tue, Dec 19, 2017 at 12:53 PM, Oved Ourfali 
> wrote:
>
>> From the latest comment it doesn't seem like a blocker to me.
>> Martin S. - your thoughts?
>>
>> On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola 
>> wrote:
>>
>>> We have a proposed blocker for the release:
>>> 1527155  Infra vdsm
>>> Bindings-API igoih...@redhat.com NEW jsonrpc reconnect logic does not
>>> work and gets stuck
>>>  urgent unspecified
>>> ovirt-4.2.0 04:30:30
>>>
>>> Please review and either approve the blcoker or postpone to 4.2.1.
>>> Thanks,
>>>
>>>
>>> --
>>>
>>> SANDRO BONAZZOLA
>>>
>>> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>>>
>>> Red Hat EMEA 
>>> 
>>> TRIED. TESTED. TRUSTED. 
>>>
>>>
>>> ___
>>> Devel mailing list
>>> Devel@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>
>>
>
>
> --
> Martin Perina
> Associate Manager, Software Engineering
> Red Hat Czech s.r.o.
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Oved Ourfali
>From the latest comment it doesn't seem like a blocker to me.
Martin S. - your thoughts?

On Tue, Dec 19, 2017 at 1:48 PM, Sandro Bonazzola 
wrote:

> We have a proposed blocker for the release:
> 1527155  Infra vdsm
> Bindings-API igoih...@redhat.com NEW jsonrpc reconnect logic does not
> work and gets stuck 
> urgent unspecified ovirt-4.2.0 04:30:30
>
> Please review and either approve the blcoker or postpone to 4.2.1.
> Thanks,
>
>
> --
>
> SANDRO BONAZZOLA
>
> ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R
>
> Red Hat EMEA 
> 
> TRIED. TESTED. TRUSTED. 
>
>
> ___
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

[ovirt-devel] oVirt 4.2.0 GA status

2017-12-19 Thread Sandro Bonazzola
We have a proposed blocker for the release:
1527155  Infra vdsm
Bindings-API igoih...@redhat.com NEW jsonrpc reconnect logic does not work
and gets stuck  urgent
unspecified ovirt-4.2.0 04:30:30

Please review and either approve the blcoker or postpone to 4.2.1.
Thanks,


-- 

SANDRO BONAZZOLA

ASSOCIATE MANAGER, SOFTWARE ENGINEERING, EMEA ENG VIRTUALIZATION R

Red Hat EMEA 

TRIED. TESTED. TRUSTED. 
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel

Re: [ovirt-devel] Migration failed

2017-12-19 Thread Michal Skrivanek

> On 19 Dec 2017, at 10:14, Arik Hadas  wrote:
> 
> 
> 
> On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek 
> > wrote:
> 
> > On 18 Dec 2017, at 13:21, Milan Zamazal  > > wrote:
> >
> > Yedidyah Bar David > writes:
> >
> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review  >> > wrote:
> >>> Jenkins CI posted comments on this change.
> >>>
> >>
> >>> View Change
> >>>
> >>> Patch set 3:Continuous-Integration -1
> >>>
> >>> Build Failed
> >>>
> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/
> >>>  
> >>> 
> >>> : FAILURE
> >>
> >> Console output of above job says:
> >>
> >> 08:13:34   # migrate_vm:
> >> 08:16:37 * Collect artifacts:
> >> 08:16:40 * Collect artifacts: Success (in 0:00:03)
> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
> >> 08:16:40   # Results located at
> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migrations.py.junit.xml
> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
> >> 08:16:40 Error occured, aborting
> >>
> >> The file 006_migrations.py.junit.xml [1] says:
> >>
> >> 
> >
> > Reading the logs, I can see the VM migrates normally and seems to be
> > reported to Engine correctly.  When Engine receives end-of-migration
> > event, it sends Destroy to the source (which is correct), calls dumpxmls
> > on the destination in the meantime (looks fine to me) and then calls
> 
> looks like a race between getallvmstats reporting VM as Down (statusTime: 
> 4296271980) being processed, while there is a Down/MigrationSucceeded event 
> arriving (with notify_time 4296273170) at about the same time
> Unfortunately the vdsm.log is not in DEBUG level so there’s very little 
> information as to why and what exactly did it send out.
> @infra - can you enable debug log level for vdsm by default? 
> 
> It does look like a race to me - does it reproduce? 
> 
> > Destroy on the destination, which is weird and I don't understand why
> > the Destroy is invoked.
> >
> > Arik, would you like to take a look?  Maybe I overlooked something or
> > maybe there's a bug.  The logs are at
> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check-patch-el7-x86_64/2882/artifact/exported-artifacts/basic-suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/
> >  
> > 
> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
> 
> So it looks like that:
> 1. the engine polls the VMs from the source host
> 2. right after #1 we get the down event with proper exit reason (= migration 
> succeeded) but the engine doesn't process it since the VM is being locked by 
> the monitoring as part of processing that polling (to prevent two analysis of 
> the same VM simultaneously).
> 3. the result of the polling is a VM in status Down and must probably 
> exit_status=Normal
> 4. the engine decides to abort the migration and thus the monitoring thread 
> of the source host destroys the VM on the destination host.
> 
> Unfortunately we don't have the exit_reason that is returned by the polling.
> However, the only option I can think of is that it is different than 
> MigrationSucceeded, because otherwise we would have hand-over the VM to the 
> destination host rather than aborting the migration [1].
> That part of the code recently changed as part of [2] - we used to hand-over 
> the VM when we get from the source host:
> status = Down + exit_status = Normal 
> And in the database: previous_status = MigrationFrom
> But after that change we require:
> status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded **
> And in the database: previous_status = MigrationFrom
> 
> Long story short, is it possible that VDSM had set the status of the VM to 
> Down and exit_status to Normal but the exit_reason was not updated (yet?) to 
> MigrationSucceeded?

ok, so there might be a plausible explanation
the guest drive mapping introduced a significant delay into the VM.getStats 
call since it tries to update the mapping when it detects a change. That is 
likely to happen on lifecycle changes. In the OST case it took 1.2s to finish 
the whole call, and in the meantime the migration has finished. The getStats() 
call is not written with possible state change in mind, so if it so happens and 
the state moves from anything to Down in the middle of it it returns a Down 
state without exitCode and exitReason which confuses engine. We started to use 
the exitReason code to differentiate the various flavors of Down in engine in 
~4.1 

Re: [ovirt-devel] oVirt System Test configuration

2017-12-19 Thread Eyal Edri
We recently added a host to the upgrade suite [1], so I guess it shouldn't
be hard to include upgrading the host there as well?

[1] https://gerrit.ovirt.org/#/c/80687/

On Mon, Dec 18, 2017 at 3:41 PM, Martin Perina  wrote:

>
>
> On Mon, Dec 18, 2017 at 2:23 PM, Eyal Edri  wrote:
>
>>
>>
>> On Mon, Dec 18, 2017 at 3:20 PM, Martin Perina 
>> wrote:
>>
>>>
>>>
>>> On Mon, Dec 18, 2017 at 2:13 PM, Martin Sivak  wrote:
>>>
 The engine only updates a short list of packages during host deploy If
 I remember correctly.

 See: https://gerrit.ovirt.org/#/c/59897/

 Martin

>>>
>>> ​No longer true in 4.2: https://bugzilla.redhat.com/sh
>>> ow_bug.cgi?id=1380498
>>>
>>
>> Does it also apply for new host installtions or just upgrades?
>>
>
> ​No, only upgrades using Host Upgrade Manager in webadmin or RESTAPI
> ​
>
>
>>
>>
>>>
>>> ​
>>>
>>>

 On Mon, Dec 18, 2017 at 2:01 PM, Sandro Bonazzola 
 wrote:

>
>
> 2017-12-18 13:57 GMT+01:00 Eyal Edri :
>
>>
>>
>> On Mon, Dec 18, 2017 at 2:53 PM, Sandro Bonazzola <
>> sbona...@redhat.com> wrote:
>>
>>>
>>>
>>> 2017-12-18 12:42 GMT+01:00 Yaniv Kaul :
>>>


 On Mon, Dec 18, 2017 at 12:43 PM, Sandro Bonazzola <
 sbona...@redhat.com> wrote:

> Hi, I'd like to discuss what's being tested by oVirt System Test.
>
> I'm investigating on a sanlock issue that affects hosted engine hc
> suite.
> I installed a CentOS minimal VM and set repositories as in
> http://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-
> suite-master/128/artifact/exported-artifacts/reposync-config.repo
>
> Upgrade from CentOS 1708 (7.4) minimal is:
>
> Aggiornamento:
>  bind-libs-lite   x86_64
> 32:9.9.4-51.el7_4.1
> centos-updates-el7  733 k
>  bind-license noarch
> 32:9.9.4-51.el7_4.1
> centos-updates-el7   84 k
>  nss  x86_64
> 3.28.4-15.el7_4
> centos-updates-el7  849 k
>  nss-softokn  x86_64
> 3.28.3-8.el7_4
>centos-updates-el7  310 k
>  nss-softokn-freebl   x86_64
> 3.28.3-8.el7_4
>centos-updates-el7  214 k
>  nss-sysinit  x86_64
> 3.28.4-15.el7_4
> centos-updates-el7   60 k
>  nss-toolsx86_64
> 3.28.4-15.el7_4
> centos-updates-el7  501 k
>  selinux-policy   noarch
> 3.13.1-166.el7_4.7
>centos-updates-el7  437 k
>  selinux-policy-targeted  noarch
> 3.13.1-166.el7_4.7
>centos-updates-el7  6.5 M
>  systemd  x86_64
> 219-42.el7_4.4
>centos-updates-el7  5.2 M
>  systemd-libs x86_64
> 219-42.el7_4.4
>centos-updates-el7  376 k
>  systemd-sysv x86_64
> 219-42.el7_4.4
>centos-updates-el7   70 k
>
> Enabling the CentOS repos:
>
>  grub2   x86_64
>  1:2.02-0.65.el7.centos.2
> updates 29 k
>  in sostituzione di grub2.x86_64 1:2.02-0.64.el7.centos
>  grub2-tools x86_64
>  1:2.02-0.65.el7.centos.2

Re: [ovirt-devel] Migration failed

2017-12-19 Thread Arik Hadas
On Tue, Dec 19, 2017 at 12:20 AM, Michal Skrivanek <
michal.skriva...@redhat.com> wrote:

>
> > On 18 Dec 2017, at 13:21, Milan Zamazal  wrote:
> >
> > Yedidyah Bar David  writes:
> >
> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review  wrote:
> >>> Jenkins CI posted comments on this change.
> >>>
> >>
> >>> View Change
> >>>
> >>> Patch set 3:Continuous-Integration -1
> >>>
> >>> Build Failed
> >>>
> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_
> check-patch-el7-x86_64/2882/
> >>> : FAILURE
> >>
> >> Console output of above job says:
> >>
> >> 08:13:34   # migrate_vm:
> >> 08:16:37 * Collect artifacts:
> >> 08:16:40 * Collect artifacts: Success (in 0:00:03)
> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
> >> 08:16:40   # Results located at
> >> /dev/shm/ost/deployment-basic-suite-master/default/006_
> migrations.py.junit.xml
> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
> >> 08:16:40 Error occured, aborting
> >>
> >> The file 006_migrations.py.junit.xml [1] says:
> >>
> >> 
> >
> > Reading the logs, I can see the VM migrates normally and seems to be
> > reported to Engine correctly.  When Engine receives end-of-migration
> > event, it sends Destroy to the source (which is correct), calls dumpxmls
> > on the destination in the meantime (looks fine to me) and then calls
>
> looks like a race between getallvmstats reporting VM as Down (statusTime:
> 4296271980) being processed, while there is a Down/MigrationSucceeded event
> arriving (with notify_time 4296273170) at about the same time
> Unfortunately the vdsm.log is not in DEBUG level so there’s very little
> information as to why and what exactly did it send out.
> @infra - can you enable debug log level for vdsm by default?


> It does look like a race to me - does it reproduce?


> > Destroy on the destination, which is weird and I don't understand why
> > the Destroy is invoked.
> >
> > Arik, would you like to take a look?  Maybe I overlooked something or
> > maybe there's a bug.  The logs are at
> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_
> check-patch-el7-x86_64/2882/artifact/exported-artifacts/
> basic-suite-master__logs/test_logs/basic-suite-master/post-
> 006_migrations.py/
> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
>

So it looks like that:
1. the engine polls the VMs from the source host
2. right after #1 we get the down event with proper exit reason (=
migration succeeded) but the engine doesn't process it since the VM is
being locked by the monitoring as part of processing that polling (to
prevent two analysis of the same VM simultaneously).
3. the result of the polling is a VM in status Down and must probably
exit_status=Normal
4. the engine decides to abort the migration and thus the monitoring thread
of the source host destroys the VM on the destination host.

Unfortunately we don't have the exit_reason that is returned by the polling.
However, the only option I can think of is that it is different than
MigrationSucceeded, because otherwise we would have hand-over the VM to the
destination host rather than aborting the migration [1].
That part of the code recently changed as part of [2] - we used to
hand-over the VM when we get from the source host:
status = Down + exit_status = Normal
And in the database: previous_status = MigrationFrom
But after that change we require:
status = Down + exit_status = Normal ** + exit_reason = MigrationSucceeded
**
And in the database: previous_status = MigrationFrom

Long story short, is it possible that VDSM had set the status of the VM to
Down and exit_status to Normal but the exit_reason was not updated (yet?)
to MigrationSucceeded?

[1]
https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java#L291
[2] https://gerrit.ovirt.org/#/c/84387/


> >
> >> Can someone please have a look? Thanks.
> >>
> >> As a side note, if indeed this is the cause for the failure for the
> >> job, it's misleading to say "migrate_vm: Success".
> >>
> >> [1]
> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_
> check-patch-el7-x86_64/2882/artifact/exported-artifacts/
> basic-suite-master__logs/006_migrations.py.junit.xml
> >>
> >>>
> >>> To view, visit change 85177. To unsubscribe, visit settings.
> >>>
> >>> Gerrit-Project: ovirt-system-tests
> >>> Gerrit-Branch: master
> >>> Gerrit-MessageType: comment
> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5
> >>> Gerrit-Change-Number: 85177
> >>> Gerrit-PatchSet: 3
> >>> Gerrit-Owner: Yedidyah Bar David 
> >>> Gerrit-Reviewer: Dafna Ron 
> >>> Gerrit-Reviewer: Eyal Edri 
> >>> Gerrit-Reviewer: Jenkins CI
> >>> Gerrit-Reviewer: Sandro Bonazzola 
> >>> Gerrit-Reviewer: Yedidyah Bar David 
> >>> Gerrit-Comment-Date: Mon, 18 Dec 

Re: [ovirt-devel] Migration failed

2017-12-19 Thread Eyal Edri
On Tue, Dec 19, 2017 at 9:22 AM, Michal Skrivanek <
michal.skriva...@redhat.com> wrote:

>
> On 19 Dec 2017, at 07:53, Eyal Edri  wrote:
>
>
>
> On Dec 19, 2017 00:20, "Michal Skrivanek" 
> wrote:
>
>
> > On 18 Dec 2017, at 13:21, Milan Zamazal  wrote:
> >
> > Yedidyah Bar David  writes:
> >
> >> On Mon, Dec 18, 2017 at 10:17 AM, Code Review  wrote:
> >>> Jenkins CI posted comments on this change.
> >>>
> >>
> >>> View Change
> >>>
> >>> Patch set 3:Continuous-Integration -1
> >>>
> >>> Build Failed
> >>>
> >>> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
> -patch-el7-x86_64/2882/
> >>> : FAILURE
> >>
> >> Console output of above job says:
> >>
> >> 08:13:34   # migrate_vm:
> >> 08:16:37 * Collect artifacts:
> >> 08:16:40 * Collect artifacts: Success (in 0:00:03)
> >> 08:16:40   # migrate_vm: Success (in 0:03:06)
> >> 08:16:40   # Results located at
> >> /dev/shm/ost/deployment-basic-suite-master/default/006_migra
> tions.py.junit.xml
> >> 08:16:40 @ Run test: 006_migrations.py: Success (in 0:03:50)
> >> 08:16:40 Error occured, aborting
> >>
> >> The file 006_migrations.py.junit.xml [1] says:
> >>
> >> 
> >
> > Reading the logs, I can see the VM migrates normally and seems to be
> > reported to Engine correctly.  When Engine receives end-of-migration
> > event, it sends Destroy to the source (which is correct), calls dumpxmls
> > on the destination in the meantime (looks fine to me) and then calls
>
> looks like a race between getallvmstats reporting VM as Down (statusTime:
> 4296271980) being processed, while there is a Down/MigrationSucceeded event
> arriving (with notify_time 4296273170) at about the same time
> Unfortunately the vdsm.log is not in DEBUG level so there’s very little
> information as to why and what exactly did it send out.
> @infra - can you enable debug log level for vdsm by default?
>
>
> How do you enable debug mode for vdsm ?
>
>
> Can be permanently set in /etc/vdsm/logger.conf
> I suppose enabling for all loggers would be useful
>
> see https://access.redhat.com/articles/2919931
>


Since this file probably is created only after host is added, I assume it
can be set only after add host test, and not in deployment scripts.
So might be needed to added as a test or step after the 'verify add host'
step.

I'll open a ticket on it and we'll see who can assist with it.



>
> Thanks,
> michal
>
>
>
> It does look like a race to me - does it reproduce?
>
> > Destroy on the destination, which is weird and I don't understand why
> > the Destroy is invoked.
> >
> > Arik, would you like to take a look?  Maybe I overlooked something or
> > maybe there's a bug.  The logs are at
> > http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic-
> suite-master__logs/test_logs/basic-suite-master/post-006_migrations.py/
> > and the interesting things happen around 2017-12-18 03:13:43,758-05.
> >
> >> Can someone please have a look? Thanks.
> >>
> >> As a side note, if indeed this is the cause for the failure for the
> >> job, it's misleading to say "migrate_vm: Success".
> >>
> >> [1]
> >> http://jenkins.ovirt.org/job/ovirt-system-tests_master_check
> -patch-el7-x86_64/2882/artifact/exported-artifacts/basic-
> suite-master__logs/006_migrations.py.junit.xml
> >>
> >>>
> >>> To view, visit change 85177. To unsubscribe, visit settings.
> >>>
> >>> Gerrit-Project: ovirt-system-tests
> >>> Gerrit-Branch: master
> >>> Gerrit-MessageType: comment
> >>> Gerrit-Change-Id: I7eb386744a2a2faf0acd734e0ba44be22dd590b5
> >>> Gerrit-Change-Number: 85177
> >>> Gerrit-PatchSet: 3
> >>> Gerrit-Owner: Yedidyah Bar David 
> >>> Gerrit-Reviewer: Dafna Ron 
> >>> Gerrit-Reviewer: Eyal Edri 
> >>> Gerrit-Reviewer: Jenkins CI
> >>> Gerrit-Reviewer: Sandro Bonazzola 
> >>> Gerrit-Reviewer: Yedidyah Bar David 
> >>> Gerrit-Comment-Date: Mon, 18 Dec 2017 08:17:11 +
> >>> Gerrit-HasComments: No
> > ___
> > Devel mailing list
> > Devel@ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/devel
> >
> >
>
> ___
> Infra mailing list
> in...@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>
>
>


-- 

Eyal edri


MANAGER

RHV DevOps

EMEA VIRTUALIZATION R


Red Hat EMEA 
 TRIED. TESTED. TRUSTED. 
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel