On 5 December 2013 18:52, Mark Shuttleworth <[email protected]> wrote: > Would it help if Juju could maintain an awareness of the disk situation > and gracefully avoid making the problem worse (and avoid corruption) by > going read-only when disk is low?
Out of disk is quite an insidious position to be in. I'd say it is first important not to cause it, and then not to get corrupted if it does happen. There were two things that caused it. The first was that all of the machines were sending logging information to all-machines.log and they were all doing something every second or so and failing, which lead to a rapid log build up. Given that logging information is useful, I don't see any option to silence it before running out of disk because it might make it impossible to diagnose a problem. The core issue was probably that all of the agents were rapidly retrying something and all failing. Some form of exponential backoff would have helped there. Once we had suppressed that problem, the mongo database was in a strange condition which caused it to eat disk space at 75mb/sec. This persisted even after restarts of juju-db. I don't know if the root cause of this is understood. It did seem that the resulting files were mostly null bytes, which is probably telling. We let it eat 14GB on another disk before stopping it and doing a mongodump/restore back to a ~300mb database, which fixed the rapid growth but revealed that there was something wrong with the transactions, presumably because one was incorrectly written owing to out of disk. After this Roger encountered many small issues which I presume were remnants of the age of our environment and our abuse of it because we didn't understand juju's model (and/or juju has changed in lots of little ways since the system was created). That's my brief understanding of the situation. All the best, - Peter
-- Juju mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju
