[ovirt-devel] Re: "env issues" in CI (was: virt-sparsify failed (was: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 479 - Failure!))

Michal Skrivanek Thu, 15 Oct 2020 04:02:11 -0700


> On 15 Oct 2020, at 12:16, Yedidyah Bar David <d...@redhat.com> wrote:
> 
> On Thu, Oct 15, 2020 at 12:44 PM Michal Skrivanek <mskri...@redhat.com> wrote:
>> 
>> 
>> 
>>> On 14 Oct 2020, at 08:14, Yedidyah Bar David <d...@redhat.com> wrote:
>>> 
>>> On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsof...@redhat.com> wrote:
>>>> 
>>>> On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <d...@redhat.com> wrote:
>>>>> The next run of the job (480) did finish successfully. No idea if it
>>>>> was already fixed by a patch, or is simply a random/env issue.
>>>> 
>>>> I think this is env issue, we run on overloaded vms with small amount of 
>>>> memory.
>>>> I have seen such radnom failures before.
>>> 
>>> Generally speaking, I think we must aim for zero failures due to "env
>>> issues" - and not ignore them as such.
>> 
>> Exactly. We cannot ignore that any longer.
>> 
>>> It would obviously be nice if we had more hardware in CI, no doubt.
>> 
>> there’s never enough
>> 
>>> But I wonder if perhaps stressing the system like we do (due to resources
>>> scarcity) is actually a good thing - that it helps us find bugs that real
>>> users might also run into in actually legitimate scenarios
>> 
>> yes, it absolutely does.
>> 
>>> - meaning, using
>>> what we recommend in terms of hardware etc. but with a load that is higher
>>> than what we have in CI per-run - as, admittedly, we only have minimal
>>> _data_ there.
>>> 
>>> So: If we decide that some code "worked as designed" and failed due to
>>> "env issue", I still think we should fix this - either in our code, or
>>> in CI.
>> 
>> yes!
> 
> Or, as applicable e.g. to current case, if we can't reproduce, at least
> add more information so that a next (random) reproduction reveals more
> information...
> 
>> 
>>> 
>>> For latter, I do not think it makes sense to just say "the machines are
>>> overloaded and not have enough memory" - we must come up with concrete
>>> details - e.g. "We need at least X MiB RAM".
>> 
>> I’ve spent quite some time analyzing the flakes in basic suite this past 
>> half year…so allow me to say that that’s usually just an excuse for a lousy 
>> test (or functionality:)
>> 
>>> 
>>> For current issue, if we are certain that this is due to low mem, it's
>>> quite easy to e.g. revert this patch:
>>> 
>>> https://gerrit.ovirt.org/110530
>>> 
>>> Obviously it will mean either longer queues or over-committing (higher
>>> load). Not sure which.
>> 
>> it’s difficult to pinpoint the reason really. If it’s happening rarely (as 
>> this one is) you’d need a statistically relevant comparison. Which takes 
>> time…
>> 
>> About this specific sparsify test - it was me uncommenting it few months 
>> ago, after running around 100 tests over a weekend. It may have failed once 
>> (there were/are still some other flakes)…but to me considering the overall 
>> success rate being quite low at that time it sounded acceptable.
> 
> ... E.g. in current case (I wasn't aware of above), if it fails for
> you even once, and you can't find the root cause, perhaps better make
> sure to log more information, so that a next case will be more likely
> to help us find the root cause. Or open a bug for that, if you do not
> do this immediately.
> 
>> If this is now happening more often then it does sound like a regression 
>> somewhere. Could be all the OST changes or tests rearrangements, but it also 
>> could be a code regression.
> 
> I have no idea if it happens more often. I think I only noticed this once.
> 
>> 
>> Either way it’s supposed to be predictable.
> 
> Really? The failure of virt-sparsify? So perhaps reply on the other
> thread explaining how to reproduce, or even fix :-)


sadly, no:) the environment is. And that’s why we’re moving towards that. ost 
images gives a complete software isolation (all repos are disabled), the only 
thing it still does over network is to download the relatively small cirros 
image from glance.ovirt.org. And running OST on baremetal gives you the 
isolation from possible concurrent runs in CI env.

as for a followup, it’s a virt test, adding Arik

> 
>> And it is, just not in this environment we use for this particular job - 
>> it’s the old one without ost-images, inside the troublesome mock, so you 
>> don’t know what it picked up, what’s the really underlying system(outside of 
>> mock)
>> 
>> Thanks,
>> michal
>> 
>>> 
>>> But personally, I wouldn't do that without knowing more (e.g. following
>>> the other thread).
>>> 
>>> Best regards,
>>> --
>>> Didi
>>> 
>> 
> 
> 
> --
> Didi
> 
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/TKUKHCHLPXUVZ2C3HWO2BRTJVOWUAJJZ/

[ovirt-devel] Re: "env issues" in CI (was: virt-sparsify failed (was: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 479 - Failure!))

Reply via email to