Hi all, We're experiencing huge nodepool slowness under load. Nodes are in the delete state for a long time (sometimes up to 20 minutes) before they actually get removed (we see very similar things for node creation too), and that exhausts our resources very quickly and our throughput slows to the speed of a snail with heavy shopping.
To try and figure out why, I wrote a little log analysis tool, and here are some graphs from the data. Individual task time taken https://s3.amazonaws.com/uploads.hipchat.com/8522/961402/4H008OHlrWf4NLm/task-time.png This shows the time taken in seconds by each nodepool task (e.g. AddFloatingIPTask). Yes, it's slow, but consistent. During high load, the tasks only get more densely packed, they don't get slower. Nodepool task queue size https://s3.amazonaws.com/uploads.hipchat.com/8522/961402/1S0kAiKGMMQCrpb/queue-size.png This shows the number of individual nodepool tasks (e.g. AddFloatingIPTask) waiting in the queue. Guess when a load of jobs hit us! Total node deletion time https://s3.amazonaws.com/uploads.hipchat.com/8522/961402/ixQxq4U4C5icl2K/deletion-time.png That shows the amount of time the nodes spend in the delete state, from going from used to delete, to all the delete tasks having run and the node getting removed. Take a look at what happens when there's a lot of stuff in the queue. Ouchy. Our 'rate' is the default of 1.0. Any ideas or help would be appreciated! Thanks, Mike _______________________________________________ OpenStack-Infra mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
