Re: Wiki and Mailing Lists Outage -- 2012-11-14

Eyal Edri Wed, 14 Nov 2012 07:18:01 -0800

we are running a disk space job on jenkins slave: 
http://jenkins.ovirt.org/view/system-monitoring/job/check_disk_space_on_jenkins_slaves


it runs a script [1], i guess we can clone this to check other infra servers as 
well.. 


[1]
#!/bin/sh
df -H | grep -vE '^Filesystem|tmpfs|cdrom|file.tlv|loop' | awk '{ print $5 " " 
$1 }' | while read output;
do
  echo $output
  usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1  )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $usep -ge 90 ]; then
    echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on 
$(date)" 
    exit 1
  fi
done

----- Original Message -----
> From: "Mike Burns" <[email protected]>
> To: "Doron Fediuck" <[email protected]>
> Cc: "infra" <[email protected]>, "users" <[email protected]>, "board" 
> <[email protected]>
> Sent: Wednesday, November 14, 2012 3:58:17 PM
> Subject: Re: Wiki and Mailing Lists Outage -- 2012-11-14
> 
> On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote:
> > Thanks Mike!
> > I suggest to have a cron alerting for no-space issues.
> 
> We run logwatch which is supposed to highlight these issues, but I
> suspect that no one is actually reading the logwatch report.  A
> separate
> cron job or monitoring service is also a possibility.
> 
> Mike
> > 
> > ----- Original Message -----
> > > From: "Mike Burns" <[email protected]>
> > > To: "board" <[email protected]>, "infra" <[email protected]>, "users"
> > > <[email protected]>
> > > Sent: Wednesday, November 14, 2012 3:31:11 PM
> > > Subject: Wiki and Mailing Lists Outage -- 2012-11-14
> > > 
> > > We experienced an outage today in both the wiki and the mailing
> > > lists.
> > > 
> > > * Wiki content was available throughout the outage, but attempts
> > > to
> > > login or edit received an error message about requiring cookies
> > > to be
> > > enabled.
> > > * All mails to the mailing  list failed to show up on the lists,
> > > but
> > > also did not return rejection messages.
> > > 
> > > Cause:
> > > 
> > > This was caused by an "Out of Space" error on the host running
> > > both
> > > of
> > > these services.  A temporary workaround was put in place to get
> > > both
> > > services up and running again.
> > > 
> > > 
> > > Action Taken:
> > > 
> > > Remove the oldest gerrit backup (600MB)
> > > Remove some older non-functional ovirt-node-iso images and rpms
> > > from
> > > the
> > > releases (source remains there)
> > > 
> > > Long term solution:
> > > 
> > > Migrating these services away from a single host onto hosted
> > > solutions
> > > (OpenShift, AlterWay).
> > > 
> > > Current Situation:
> > > 
> > > Wiki is back up and running, login works as expected
> > > Lists are processing the backlog of emails since the outage
> > > began.
> > > At this time, it does not appear that any mail was lost due to
> > > the
> > > outage.
> > > 
> > > 
> > > Thanks for the patience and understanding
> > > 
> > > Mike
> > > 
> > > _______________________________________________
> > > Infra mailing list
> > > [email protected]
> > > http://lists.ovirt.org/mailman/listinfo/infra
> > > 
> > _______________________________________________
> > Board mailing list
> > [email protected]
> > http://lists.ovirt.org/mailman/listinfo/board
> 
> 
> _______________________________________________
> Infra mailing list
> [email protected]
> http://lists.ovirt.org/mailman/listinfo/infra
> 
_______________________________________________
Infra mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/infra

Re: Wiki and Mailing Lists Outage -- 2012-11-14

Reply via email to