Bug 1338841 [1] started showing up yesterday and I first noticed it on the change to set osapi_volume_workers equal to the number of CPUs available by default. Similar patches for trove (api/conductor workers) and glance (api/registry workers) have landed in the last week also, and nova has been running with multiple api/conductor workers by default since Icehouse.

It looks like the cinder change tipped the default postgresql max_connections over and we started getting asynchronous connection failures in that job. [2]

We can also note that the postgresql job is the only one that runs the nova api-metadata service, which has it's own workers.

The VMs the jobs are running on have 8 VCPUs, so that's at least 88 workers between nova (3), cinder (1), glance (2), trove (2), neutron, heat and ceilometer.

So osapi_volume_workers (8) + n-api-meta workers (8) seems to have tipped it over.

The first attempt at a fix is to simply double the default max_connections value [3].

While looking up the postgresql configuration docs, I also read a bit on synchronous_commit=off and fsync=off, which sound like we might want to also think about using one of those in devstack runs since they are supposed to be more performant if you don't care about disaster recovery (which we don't in gate runs on VMs).

Anyway, bumping max connections might fix the gate, I'm just sending this out to see if there are any postgresql experts out there with additional tips or insights on things we can tweak or look for, including whether or not it might be worthwhile to set synchronous_commit=off or fsync=off for gate runs.

[1] https://bugs.launchpad.net/nova/+bug/1338841
[2] http://goo.gl/yRBDjQ
[3] https://review.openstack.org/#/c/105854/

--

Thanks,

Matt Riedemann


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to