Hi folks,

We had another scale test today to analyse why the controller CPU usage didn't fall away as expected when the models were removed.

I'll be filing a bunch of bugs from the analysis process, but there is one bug that is, I believe, the culprit for the high CPU usage.

Interestingly enough, Juju developers were not able to reproduce the problem with smaller deployments. The scale that we were testing was 140 models each with 10 machines and about 20 total units.

During the teardown process of the testing, all models were destroyed at once.

We have most of the responsive nature of Juju is driven off the watchers. These watchers watch the mongo oplog for document changes. What happened was that there were so many mongo operations, the capped collection of the oplog was completely replaced between our polled watcher delays. The watchers then errored out in a new unexpected way.

Effectively the watcher infrastructure needs an internal reset button that it can hit when this happens that invalidates all the watchers. This should cause all the workers to be torn down and restarted from a known good state.

There was a model that got stuck being destroyed, this is tracked back to a worker that should be doing the destructions not noticing.

All the CPU usage can be tracked back to the 139 models in the apiserver state pools each still running leadership and base watcher workers. The state pool should have removed all these instances, but it didn't notice they were gone.

There are some other bugs around logging things as errors that really aren't errors that contributed to log noise, but the fundamental error here is not being robust in the face of too much change at once.

This needs to be fixed for the 2.2 release candidate, so it may well push that out past the end of this week.


Juju-dev mailing list
Modify settings or unsubscribe at: 

Reply via email to