Le mercredi 15 août 2018 à 14:50 +0530, Sankarshan Mukhopadhyay a écrit : > Thank you for (a) addressing the issue and (b) this write up > > Does the -infra team have a way to monitor disk space usage?
Munin: http://munin.gluster.org/munin/rht.gluster.org/jenkins-el7.rht.gluster. org/index.html#disk Seems we did had notifications, but that was turned (by me) on May 2106 with a laconic "receiving too much of them for now". I guess it was sending too much false positive and we didn't spend time to fix that. I want to move it out of rackspace since a long time, since it can't monitor the internal network, and also move to nagios for alerting (since you can filter alert). > On Wed, Aug 15, 2018 at 2:40 PM Michael Scherer <[email protected]> > wrote: > > > > Hi folks, > > > > So Gluster jenkins disk was full today (cause outages do not > > respect > > public holiday in India (Independance day) and France(Assumption)), > > here is the post mortem for your reading pleasure > > > > Date: 15/08/2018 > > > > Service affected: > > Jenkins for Gluster (jenkins-el7.rht.gluster.org) > > > > Impact: > > > > No jenkins job could be triggered. > > > > Root cause: > > > > A disk full mainly because we got new jobs and more patches, so > > regular growth. > > > > Resolution: > > > > Increased the disk by 30G, and investigating if cleanup could be > > improved. This did require a reboot. > > > > > > Involved people: > > - misc > > - nigel > > > > Lessons learned > > - What went well: > > - we had a documented process for that, and good enough to be > > used by > > a tired admin. > > > > - What went bad: > > - we weren't proactive enough to see that before it caused a > > outage > > - 15 of August is a holiday for both France and India. > > Technically, > > none of the infra team should have been up. > > > > - When we were lucky > > - It was a day off in India, so few people were affected, except > > folks who continue to work on days off > > - Misc decided to go to work while being in Brno to take days off > > later > > > > > > Timeline (in UTC) > > > > - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra: > > https://lists.gluster.org/pipermail/gluster-infra/2018-August/00479 > > 5.ht > > ml > > > > - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is > > away from laptop for Independence day celebration. > > > > - 06:24 Misc do not hear the ding since he is asleep > > > > - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/sh > > ow_b > > ug.cgi?id=1616160 > > > > - 06:56 Misc do not see the email since he is still asleep > > > > - 07:13 Misc wake up, see a blinking light on the phone and ponder > > about closing his eyes again. He look at it, and start to swear. > > > > - 07:14 Investigation reveal that Jenkins partition is full (100%). > > A > > quick investigation do not yield any particular issues. The Jenkins > > jobs are taking space and that's it. > > > > - 07:19 After discussion with Nigel, it is decided to increase the > > size > > of the partition. Misc take a look at it, try to increase without > > any > > luck. The server is rebooted in case that's what was needed. Still > > not > > enough. > > > > - 07:25 Misc go quickly shower to wake him up. The warm embrace of > > water make him remember that a documentation on that process do > > exist: > > > > https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_part > > itio > > n.html > > > > - 07:30 Following the documentation, we discover that the > > hypervisor > > is now out of space for future increase. Looking at that will be > > done > > after the post mortem. > > > > - 07:37 Jenkins is being restarted, with more space, and seems to > > work > > ok. > > > > - 07:38 Misc rush to his hotel breakfast who close at 10. > > > > - 09:09 Post mortem is finished and being sent > > > > > > Action items: > > - (misc) see what can be done for myrmicinae (the hypervisor where > > jenkins is running) since there is no more space. > > > > Potential improvement to make: > > - we still need to have monitoring in place > > - we need to move munin in the internal lan for looking at the > > graph > > for jenkins > > - documentation regarding resizing could be clearer, notably on > > volume > > resizing part > > > > > > -- > > Michael Scherer > > Sysadmin, Community Infrastructure and Platform, OSAS > > > > _______________________________________________ > > Gluster-infra mailing list > > [email protected] > > https://lists.gluster.org/mailman/listinfo/gluster-infra > > > -- Michael Scherer Sysadmin, Community Infrastructure and Platform, OSAS
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Gluster-infra mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-infra
