On 09/15/2016 05:52 AM, Roman Podoliaka wrote: > Mike, > > On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer <mba...@redhat.com> wrote: > >> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur? >> If so, was it also accompanied by the same "resource closed" condition or >> did this second part of the condition only appear at 4.13.3? >> * Did we see a similar "timeout" / "resource closed" combination prior to >> 4.13.3, just with less frequency? > > I believe we did - > https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we > used mysql-python back then, so the error was slightly different. > >> * What is the magnitude of the "timeout" this fixture is using, is it on the >> order of seconds, minutes, hours? > > It's set in seconds per project in .testr.conf, e.g.: > > https://github.com/openstack/nova/blob/master/.testr.conf > https://github.com/openstack/ironic/blob/master/.testr.conf > > In Nova we also have a 'timeout scaling factor' specifically set for > migration tests: > > https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67 > >> * If many minutes or hours, can the test suite be observed to be stuck on >> this test? Has someone tried to run a "SHOW PROCESSLIST" while this >> condition is occurring to see what SQL is pausing? > > We could try to do that in the gate, but I don't expect to see > anything interesting: IMO, we'd see regular queries that should have > been executed fast, but actually took much longer time (presumably due > to heavy disk IO caused by multiple workers running similar tests in > parallel). > >> * Is this failure only present within the Nova test suite or has it been >> observed in the test suites of other projects? > > According to > > http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 > > it's mostly Nova, but this has also been observed in Ironic, Manila > and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60 > seconds. > >> * Is this failure present only on the "database migration" test suite or is >> it present in other opportunistic tests, for Nova and others? > > Based on the console logs I've checked only migration tests failed, > but that's probably due to the fact that they are usually the slowest > ones (again, presumably due to heavy disk IO). > >> * Have there been new database migrations added to Nova which are being >> exercised here and may be involved? > > Looks like there were no changes recently: > > https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22) > >> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If >> downgrading it is feasible, that would at least be a way to eliminate it as >> a possibility if these same failures continue to occur, or a way to confirm >> its involvement if they disappear. But if downgrading is disruptive then >> there are other things to look at in order to have a better chance at >> predicting its involvement. > > I don't think we need to block oslo.db 4.13.3, unless we clearly see > it's this version that causes these failures. > > I gave version 4.11 (before changes to provisioning) a try on my local > machine and see the very same errors when concurrency level is high ( > http://paste.openstack.org/show/577350/ ), so I don't think the latest > oslo.db release has anything to do with the increase of the number of > failures on CI. > > My current understanding is that the load on gate nodes somehow > increased (either we run more testr workers in parallel now or > apply/test more migrations or just more run VMs per host or the gate > is simply busy at this point of the release cycle), so that we started > to see these timeouts more often.
The migration count is definitely going to grow over time, as is the nature of the beast. Nova hasn't had a migration collapse in quite a while. The higher patch volume in Nova and larger number of db migrations could definitely account for Nova being higher here. Is there a better timeout value that you think will make these timeouts happen less often? -Sean -- Sean Dague http://dague.net __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev