Hey all, Quick note about Friday night's gate outage so that we can post-mortem it later.
Best we can tell - there was a network incident at HP where things went horribly wrong. During that period, our hypotheis is that we interpreted failure responses from our slave as "slave is gone, delete from db" when the slave was in fact still there, which then led to overrunning our quota due to slaves that needed deleting but we'd stopped knowing about. We do not have proof of this - it's a hypothesis. We (and by we I mean fungi) manually deleted all of the slaves. The problem was noticed by "lots of lost jobs showing up on status page". That makes me think that perhaps that's a metric that would be useful to track. Perhaps "number of lost jobs" and "number of jobs" so both 'lost-jobs-per-X' could be a thing we care about, but also '%-jobs-lost-per-X' THEN - once things started coming back up, they were unable to properly connect to jenkins. Again, hypothesis being that the rampant slave failures put jenkins into a bad state. We restarted it. After deleting all of the slaves and restarting jenkins, all appears to be good now. Monty _______________________________________________ OpenStack-Infra mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
