Hi folks,

So Gluster jenkins disk was full today (cause outages do not respect
public holiday in India (Independance day) and France(Assumption)),
here is the post mortem for your reading pleasure

Date: 15/08/2018

Service affected:
  Jenkins for Gluster (jenkins-el7.rht.gluster.org)

Impact:

  No jenkins job could be triggered.

Root cause:

  A disk full mainly because we got new jobs and more patches, so
regular growth.

Resolution:

  Increased the disk by 30G, and investigating if cleanup could be  
  improved. This did require a reboot.


Involved people:
- misc
- nigel

Lessons learned
- What went well:
  - we had a documented process for that, and good enough to be used by
    a tired admin.

- What went bad:
  - we weren't proactive enough to see that before it caused a outage
  - 15 of August is a holiday for both France and India. Technically, 
    none of the infra team should have been up.

- When we were lucky
  - It was a day off in India, so few people were affected, except 
    folks who continue to work on days off
  - Misc decided to go to work while being in Brno to take days off
    later


Timeline (in UTC)

- 05:58 Amar post a mail to say "smoke job fail" on gluster-infra:
https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht
ml

- 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is
away from laptop for Independence day celebration.

- 06:24 Misc do not hear the ding since he is asleep

- 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b
ug.cgi?id=1616160 

- 06:56 Misc do not see the email since he is still asleep

- 07:13 Misc wake up, see a blinking light on the phone and ponder
about closing his eyes again. He look at it, and start to swear.

- 07:14 Investigation reveal that Jenkins partition is full (100%). A
quick investigation do not yield any particular issues. The Jenkins
jobs are taking space and that's it.

- 07:19 After discussion with Nigel, it is decided to increase the size
of the partition. Misc take a look at it, try to increase without any
luck. The server is rebooted in case that's what was needed. Still not
enough.

- 07:25 Misc go quickly shower to wake him up. The warm embrace of
water make him remember that a documentation on that process do exist:

https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio
n.html

- 07:30  Following the documentation, we discover that the hypervisor
is now out of space for future increase. Looking at that will be done
after the post mortem.

- 07:37 Jenkins is being restarted, with more space, and seems to work
ok.

- 07:38 Misc rush to his hotel breakfast who close at 10.

- 09:09 Post mortem is finished and being sent


Action items:
- (misc) see what can be done for myrmicinae (the hypervisor where
jenkins is running) since there is no more space.

Potential improvement to make:
- we still need to have monitoring in place
- we need to move munin in the internal lan for looking at the graph
for jenkins
- documentation regarding resizing could be clearer, notably on volume
resizing part


-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Gluster-infra mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-infra

Reply via email to