Excellent post-mortem! Do you think its worth adding mirrors to gluster repos like oVirt is doing? [1]
[1] http://ovirt-infra-docs.readthedocs.org/en/latest/General/Mirror.html On Wed, Apr 27, 2016 at 1:56 PM, Michael Scherer <msche...@redhat.com> wrote: > Hi, > > as promised, here is the post-mortem of the incident, if you would like > to see more information, or any remarks, please do not hesitate, since > that's the first attempt at it we do. > > I modelled it based on the example of > http://shop.oreilly.com/product/0636920041528.do, as that the book I am > reading at the moment (Appendix D). We will formalize that later. > > > > Download.gluster.org was not serving file > Date: 2016-04-27 > Participating people: > - misc > > Summary: > > Download.gluster.org http server was showing error 403 for all url, > which did impact ovirt jenkins jobs, and users using the repository, > among others. The server is used to distribute gluster rpms. > > Impact: > - ovirt CI jobs got blocked > - user couldn't install gluster > > Root cause: > the underlying block device on rackspace was down for a undiagnosed > reason, triggering xfs error on the server and thus 403 on the http > level. > > the root cause of the block device error is for still unknown, no error > have been seen on the rackspace status page for this DC. A ticket was > opened with rackspace to see what was going on (160427-iad-0000814), a > follow up of this post-mortem will be done if the ticket say something > more than "shit happens". > > Resolution: > > The whole server was rebooted, and upon reboot, the block device came > back. > > Lessons learned: > - what went well: > - people notified the admin quickly on irc and on gluster-infra > > - when we were lucky > - the server and block device came back immediately > - it failed during business hours of EMEA with misc being on irc (just > arrived at the office) > > > - what went bad > - we do not have proper HA for the service > - we do not have automated monitoring for it > - the setup is using 2 blocks device of 120G in lvm, thus making it > twice as risky to fail > > Timeline (in UTC) > - 05:39 first error message in the log about XFS error > - 08:41 misc is pinged on irc > - 08:56 misc ack and diagnose the issue > - 09:00 the server and service is back to normal > - 09:00 first mail about the problem hit gluster-infra > > > Potential improvement to make: > - add monitoring on gluster side > - use the centos sig repo on ovirt side > - add more sysadmin for gluster > - add a redundant service for that > - a 2nd download server with a shared gluster backend > - migrate the storage to a proper setup with 1 single block device, > rather than 2. > > > -- > Michael Scherer > Sysadmin, Community Infrastructure and Platform, OSAS > > > > _______________________________________________ > Infra mailing list > Infra@ovirt.org > http://lists.ovirt.org/mailman/listinfo/infra > > -- Eyal Edri Associate Manager RHEV DevOps EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra