Re: Major outage

Eyal Edri Wed, 04 Feb 2015 02:33:11 -0800


----- Original Message -----
> From: "David Caro" <[email protected]>
> To: "Infra" <[email protected]>
> Sent: Tuesday, February 3, 2015 10:50:14 PM
> Subject: Re: Major outage
> 
> 
> Good news!
> 
> 
> I got a vm working in the recently-upgraded fc20 host ovirt-srv02. The issue
> with the vms seems to be that the default value for the numa setting is not
> behaving correctly with libvirt. The fc19 vms just show the input/output
> error,
> but the fc20 shows also the libvirt full string, and there you can see that
> it
> complains about numa:
> 
> 
> libvirtError: internal error: Process exited prior to exec: libvirt:  error :
> internal error: NUMA memory tuning in 'preferred' mode only supports single
> node
> 
> 
> So what I've done is edit the vm, pin it to a node, set it as not migratable
> (or whatever is the spelling) and changed the numa mode from preferred to
> strict. Saved, and then edited the vm again, reverting the host pin and the
> migration settings, but not changing the numa ones. That allowed me to boot
> one
> of the vms so far (just tested).
> 
> Some ugly issues:
> 
>   The know multipathd message in the logs... it's quite annoying and fills up
>   the logs.
>   Vdsm messed up the network a couple of times, once it removed all the ifcfg
>   files, and the other it restored old values in the rules/route files
>   Vdsm failed on vdsm-restore-net-config:89 with a non-existing key exception
>   insted of just showing an error message and continuing execution
> 
> I'll triage the above errors tomorrow and resend to the devels, for now just
> sending to avoid forgetting about them.
> 
> 
> Will continue booting the rest of production vms, do some simple sanity and
> leave the rest for tomorrow.
> 
> On the good side, we have now one fc20 host on each cluster, and 3.5 on all
> the
> production DC hosts! yay \o/


Great news!
adding some NUMA experts to see if they have an advise on optimizing it on the 
DC.

e.

> 
> If anything comes up again I'll update in this thread, if not, tomorrow
> morning
> I'll update when all the environment is working 100%
> 
> pd. Thanks Fabian and Max!!
> 
> On 02/03, David Caro wrote:
> > 
> > New update,
> > 
> > Host srv01 is up and running, but 02 and 03 have issues, they can't start
> > up
> > any vms.
> > 
> > The error is in libvirt:
> > 
> > libvirtError: Child quit during startup handshake: Input/output error
> > 
> > 
> > Looking around I saw a thread in the users lists that fixed it with_
> > 
> >   /usr/lib/systemd/systemd-vdsmd reconfigure force
> > 
> > That worked on srv01, but the others did not. So I'm trying to upgrade to
> > fc20
> > one of them, the srv02, hoping the newer libvirt version will not have that
> > issue.
> > 
> > Those two hosts are the ones that are in the production data center, and it
> > has the foreman vm, so none of the slaves is working properly until that is
> > solved.
> > 
> > 
> > Will update in ~one hour or when the problem is solved.
> > 
> > Being so late, if I get the production vms running in one host, I'll leave
> > the
> > rest for tomorrow.
> > 
> > 
> > D
> > 
> > On 02/03, David Caro wrote:
> > > 
> > > Ok, update:
> > > 
> > > 
> > > Not all the servers have been restored, most of the slave vms are up, and
> > > all
> > > but one host are up.
> > > 
> > > Engine - Ok
> > > storage -Ok
> > > storage01 - Ok
> > > storage02 - Ok
> > > srv01 - DOWN
> > > srv02 - OUT OF THE POOL (will add when 01 is up)
> > > srv03 - OK
> > > srv04 - OK
> > > srv05 - OK
> > > srv06 - OK
> > > srv07 - OK
> > > srv08 - OK
> > > 
> > > 
> > > If you need any specific vm I can try to get it up on one of the running
> > > hosts,
> > > but I'd wait until the last host is up to start all of them.
> > > 
> > > 
> > > Will update again when finished or in one hour.
> > > 
> > > 
> > > On 02/03, David Caro wrote:
> > > > 
> > > > We are having a major outage on phoenix lab, don't expect any
> > > > vms/slaves to be
> > > > properly working yet.
> > > > 
> > > > Will update when solved or in an hour with the status.
> > > > 
> > > > 
> > > > --
> > > > David Caro
> > > > 
> > > > Red Hat S.L.
> > > > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > > > 
> > > > Tel.: +420 532 294 605
> > > > Email: [email protected]
> > > > Web: www.redhat.com
> > > > RHT Global #: 82-62605
> > > 
> > > 
> > > 
> > > > _______________________________________________
> > > > Infra mailing list
> > > > [email protected]
> > > > http://lists.ovirt.org/mailman/listinfo/infra
> > > 
> > > 
> > > --
> > > David Caro
> > > 
> > > Red Hat S.L.
> > > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > > 
> > > Tel.: +420 532 294 605
> > > Email: [email protected]
> > > Web: www.redhat.com
> > > RHT Global #: 82-62605
> > 
> > 
> > 
> > > _______________________________________________
> > > Infra mailing list
> > > [email protected]
> > > http://lists.ovirt.org/mailman/listinfo/infra
> > 
> > 
> > --
> > David Caro
> > 
> > Red Hat S.L.
> > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > 
> > Tel.: +420 532 294 605
> > Email: [email protected]
> > Web: www.redhat.com
> > RHT Global #: 82-62605
> 
> 
> 
> > _______________________________________________
> > Infra mailing list
> > [email protected]
> > http://lists.ovirt.org/mailman/listinfo/infra
> 
> 
> --
> David Caro
> 
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R&D
> 
> Tel.: +420 532 294 605
> Email: [email protected]
> Web: www.redhat.com
> RHT Global #: 82-62605
> 
> _______________________________________________
> Infra mailing list
> [email protected]
> http://lists.ovirt.org/mailman/listinfo/infra
> 
_______________________________________________
Infra mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/infra

Re: Major outage

Reply via email to