we are running a disk space job on jenkins slave: http://jenkins.ovirt.org/view/system-monitoring/job/check_disk_space_on_jenkins_slaves
it runs a script [1], i guess we can clone this to check other infra servers as well.. [1] #!/bin/sh df -H | grep -vE '^Filesystem|tmpfs|cdrom|file.tlv|loop' | awk '{ print $5 " " $1 }' | while read output; do echo $output usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1 ) partition=$(echo $output | awk '{ print $2 }' ) if [ $usep -ge 90 ]; then echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)" exit 1 fi done ----- Original Message ----- > From: "Mike Burns" <mbu...@redhat.com> > To: "Doron Fediuck" <dfedi...@redhat.com> > Cc: "infra" <infra@ovirt.org>, "users" <us...@ovirt.org>, "board" > <bo...@ovirt.org> > Sent: Wednesday, November 14, 2012 3:58:17 PM > Subject: Re: Wiki and Mailing Lists Outage -- 2012-11-14 > > On Wed, 2012-11-14 at 08:45 -0500, Doron Fediuck wrote: > > Thanks Mike! > > I suggest to have a cron alerting for no-space issues. > > We run logwatch which is supposed to highlight these issues, but I > suspect that no one is actually reading the logwatch report. A > separate > cron job or monitoring service is also a possibility. > > Mike > > > > ----- Original Message ----- > > > From: "Mike Burns" <mbu...@redhat.com> > > > To: "board" <bo...@ovirt.org>, "infra" <infra@ovirt.org>, "users" > > > <us...@ovirt.org> > > > Sent: Wednesday, November 14, 2012 3:31:11 PM > > > Subject: Wiki and Mailing Lists Outage -- 2012-11-14 > > > > > > We experienced an outage today in both the wiki and the mailing > > > lists. > > > > > > * Wiki content was available throughout the outage, but attempts > > > to > > > login or edit received an error message about requiring cookies > > > to be > > > enabled. > > > * All mails to the mailing list failed to show up on the lists, > > > but > > > also did not return rejection messages. > > > > > > Cause: > > > > > > This was caused by an "Out of Space" error on the host running > > > both > > > of > > > these services. A temporary workaround was put in place to get > > > both > > > services up and running again. > > > > > > > > > Action Taken: > > > > > > Remove the oldest gerrit backup (600MB) > > > Remove some older non-functional ovirt-node-iso images and rpms > > > from > > > the > > > releases (source remains there) > > > > > > Long term solution: > > > > > > Migrating these services away from a single host onto hosted > > > solutions > > > (OpenShift, AlterWay). > > > > > > Current Situation: > > > > > > Wiki is back up and running, login works as expected > > > Lists are processing the backlog of emails since the outage > > > began. > > > At this time, it does not appear that any mail was lost due to > > > the > > > outage. > > > > > > > > > Thanks for the patience and understanding > > > > > > Mike > > > > > > _______________________________________________ > > > Infra mailing list > > > Infra@ovirt.org > > > http://lists.ovirt.org/mailman/listinfo/infra > > > > > _______________________________________________ > > Board mailing list > > bo...@ovirt.org > > http://lists.ovirt.org/mailman/listinfo/board > > > _______________________________________________ > Infra mailing list > Infra@ovirt.org > http://lists.ovirt.org/mailman/listinfo/infra > _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra