On Wed, Feb 10, 2016 at 11:49 AM, Eyal Edri <[email protected]> wrote:
> Great root call analysis! > s/call/cause :) > Maybe we should add something like 'outage' guidelines on the wiki or > readthedocs for any infra member that is about to do something > that might affect the DC? > > Probably email to the infra/devel list should be OK, or even emailing to > [email protected] so there will be an open ticket. > > thoughts? > > E. > > On Tue, Feb 9, 2016 at 4:18 PM, David Caro <[email protected]> wrote: > >> On 02/08 20:28, David Caro wrote: >> > >> > Hi everyone! >> > >> > There has been a storage outage today, it started around 17:30 CEST and >> spanned >> > until ~20:15. All the services are back up now and running, but a bunch >> of >> > jenkins jobs failed due to the outage (all the slaves are using that >> storage) >> > so you might see some false positives in your ci runs. To retrigger you >> can use >> > this job: >> > >> > http://jenkins.ovirt.org/gerrit_manual_trigger/ >> > >> > And/or submit a new patchset (rebasing should work). In any case, if >> you have >> > any issues or doubts, please respond to this email or ping me >> (dcaro/dcaroest) >> > on irc. >> > >> > Sorry for the inconvenience, we are gathering logs to find out what >> happend and >> > prevent it from happening in the future. >> >> >> So the source of the issue has been sorted out, there was some >> uncoordinated >> effort that ended up changing the LACP settings in the switches for all >> the >> hosts, what caused a global network outage (all the hosts were affected) >> and >> that in turn caused the clustering to freeze as none of the nodes was >> able to >> contact the network both went down. >> >> Then, once the network came up, the master of the cluster tried to >> remount the >> drdb storage but was unable to due to some process keeping it busy, and >> did not >> fully start up. >> >> That is a scenario that we did not test (we tested one node failure, not >> both) >> so will have to investigate that failure case and find a solution for the >> clustering. >> >> We are also talking with the hosting to properly sync with us on that >> type of >> interventions so this will not happen again. >> >> >> Thanks for your patience >> >> > >> > -- >> > David Caro >> > >> > Red Hat S.L. >> > Continuous Integration Engineer - EMEA ENG Virtualization R&D >> > >> > Tel.: +420 532 294 605 >> > Email: [email protected] >> > IRC: dcaro|dcaroest@{freenode|oftc|redhat} >> > Web: www.redhat.com >> > RHT Global #: 82-62605 >> >> >> >> -- >> David Caro >> >> Red Hat S.L. >> Continuous Integration Engineer - EMEA ENG Virtualization R&D >> >> Tel.: +420 532 294 605 >> Email: [email protected] >> IRC: dcaro|dcaroest@{freenode|oftc|redhat} >> Web: www.redhat.com >> RHT Global #: 82-62605 >> >> _______________________________________________ >> Devel mailing list >> [email protected] >> http://lists.ovirt.org/mailman/listinfo/devel >> > > > > -- > Eyal Edri > Associate Manager > EMEA ENG Virtualization R&D > Red Hat Israel > > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) > -- Eyal Edri Associate Manager EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________ Infra mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/infra
