Hey guys, I did some scale testing back in Capetown, and I realized I didn't send any results to be shared. Here's the brain dump of what I remember:
1) juju deploy -n 15 fails to spawn a bunch of the machines with "Rate Limit Exceeded". I filed bug for it. The instance poller showed up the most in the log files. A good improvement there is to change our polling to do all machines in batch (probably with a fast batch and a slow batch?) A bigger fix is that if provisioning fails because of rate limiting we should just try again later. 2) After 1-2k units restarting agent machine-0 could cause it to deadlock. >From what I could tell we were filling the request socket to mongo and it just wasn't returning responses. Even to the point of getting a 5 minute timeout trying to write to the socket. I added a "max 10 concurrent Login requests" semaphore and I could see it activating and telling agents to come back later, and never deadlocked it again. Need more testing here. I think it is worth adding. (10/100/whatever, the key seems to be avoiding having 10,000 units all trigger the same request at the same time.) 3) Things get weird if you try to deploy more unit agents than you have RAM for. I could only get about 800-900 agents in 7.5GB of RAM (and they don't have swap). I doubt this is ever an issue in practise but was odd to figure out. 4) we still need to land the GOMAXPROCS code for the API servers. 5) Even with that, I don't think I saw more than 300% load on the API server and never more than 80% for mongo. I think mongo is often a bottleneck, but I don't know how to make it use more than 1 CPU. Maybe because we use only one connection? 6) We still have the CharmURL bug, but we knew that. 7) I think getting some charms that have basic relations and having them trigger at scale could be really useful. John =:->
-- Juju-dev mailing list [email protected] Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
