On Sunday, 3 September 2017 12:47:34 PM AEST Russell Coker wrote:
> The luv server was down this morning because of a KVM error.  Also another
> KVM VM on the same system crashed.  Sorry for sleeping in.

It turned out to be BTRFS mis-managing free space, deciding there was none 
left, and going into read-only mode.  The QEMU/KVM server blocked on disk IO 
and paused the virtual machines, which meant that they couldn't even respond 
to pings.

I've setup a cron job to run a weekly balance on the BTRFS filesystem which 
will prevent this happening again.  I've seen similar things in the past but 
didn't expect them in this case because the filesystem is only 50% full.

Also I had got an alert about problems before going to sleep last night, but 
it didn't look like an important issue (looked like just a "certificate is 
going to expire in 2 weeks" not "can't even talk to SSL server").  I've re-
written the monitor script in question to give more useful information so I 
won't make that mistake in future.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/
_______________________________________________
luv-main mailing list
[email protected]
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

Reply via email to