On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <[email protected]> wrote:
> > > On 21 January 2018 at 12:50, Eyal Edri <[email protected]> wrote: > >> >> >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <[email protected]> >> wrote: >> >>> >>> >>> On 21 January 2018 at 12:39, Eyal Edri <[email protected]> wrote: >>> >>>> There is another issue, which is currently failing all CQ, and its >>>> related to the new IBRS CPU model. >>>> It looks like all of the lago slaves were upgraded to new Libvirt and >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for >>>> that. >>>> >>>> I think there was a misunderstanding about what to upgrade, and it >>>> might have been understood that only the bios upgrade breaks it and not the >>>> kernel one. >>>> >>>> In any case, we're currently fixing the issue, either by downgrading >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types >>>> from OST. >>>> >>>> For future, I suggest a few updates to maintenance work on Jenkins >>>> slaves ( VMs or BM ): >>>> >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun >>>> ), so all the team can be around to help if needed or if something >>>> unexpected happens. >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window >>>> in between, >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and >>>> wait to see if nothing breaks and continue after we verify OST runs ( >>>> either seeing on CQ or running manually ) >>>> >>>> >>>> Thoughts? >>>> >>>> >>> We have a staging system - we should be using it for staging.... >>> >> >> Do we have OST tests or manual job avaialble there? >> > > We can add them easily, or simply run Lago manually when needed. > > >> In any case, this doesn't contradict what I suggested, even if you test >> on staging, there could be differences from the production system, so we >> should take care when we upgrade regardless. >> > > Yes, but at least we'd know we green lighted the new configuration - I'm > sure in this case we could have found at least some of the issues on > staging (Like the fc27 issues for example) and could have avoided expansive > production failures. > > Another point when scheduling an upgrade, is to talk to infra owner or the >> CI team and understand if we currently have a large Q in CQ or known >> failures, so it might be best to wait a bit until its cleared. >> >> > > Adding infra-support so we can gather this info and prepare a maintanaince / upgrade checklist to add to the oVirt infra docs. Let's continue the discussion, suggestion on that ticket. > -- > Barak Korren > RHV DevOps team , RHCE, RHCi > Red Hat EMEA > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted > -- Eyal edri MANAGER RHV DevOps EMEA VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________ Infra mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/infra
