Re: Change-queue job failures this weekend

Eyal Edri Sun, 21 Jan 2018 03:11:06 -0800

On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <[email protected]> wrote:


>
>
> On 21 January 2018 at 12:50, Eyal Edri <[email protected]> wrote:
>
>>
>>
>> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On 21 January 2018 at 12:39, Eyal Edri <[email protected]> wrote:
>>>
>>>> There is another issue, which is currently failing all CQ, and its
>>>> related to the new IBRS CPU model.
>>>> It looks like all of the lago slaves were upgraded to new Libvirt and
>>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for
>>>> that.
>>>>
>>>> I think there was a misunderstanding about what to upgrade, and it
>>>> might have been understood that only the bios upgrade breaks it and not the
>>>> kernel one.
>>>>
>>>> In any case, we're currently fixing the issue, either by downgrading
>>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
>>>> from OST.
>>>>
>>>> For future, I suggest a few updates to maintenance work on Jenkins
>>>> slaves ( VMs or BM ):
>>>>
>>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun
>>>> ), so all the team can be around to help if needed or if something
>>>> unexpected happens.
>>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
>>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window
>>>> in between,
>>>>   e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
>>>> wait to see if nothing breaks and continue after we verify OST runs (
>>>> either seeing on CQ or running manually )
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>>
>>> We have a staging system - we should be using it for staging....
>>>
>>
>> Do we have OST tests or manual job avaialble there?
>>
>
> We can add them easily, or simply run Lago manually when needed.
>
>
>> In any case, this doesn't contradict what I suggested, even if you test
>> on staging, there could be differences from the production system, so we
>> should take care when we upgrade regardless.
>>
>
> Yes, but at least we'd know we green lighted the new configuration - I'm
> sure in this case we could have found at least some of the issues on
> staging (Like the fc27 issues for example) and could have avoided expansive
> production failures.
>
> Another point when scheduling an upgrade, is to talk to infra owner or the
>> CI team and understand if we currently have a large Q in CQ or known
>> failures, so it might be best to wait a bit until its cleared.
>>
>>
>
>

Adding infra-support so we can gather this info and prepare a maintanaince
/ upgrade checklist to add to the oVirt infra docs.
Let's continue the discussion, suggestion on that ticket.




> --
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>



-- 

Eyal edri


MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D


Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)

_______________________________________________
Infra mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/infra

Re: Change-queue job failures this weekend

Reply via email to