Hmm, looks like we now run more testr workers in parallel (8 instead of 4):
http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html On my machine running Nova migration tests against MySQL is much slower with 8 workers than with 4 due to disk IO (it's HDD). When they time out (after 320s) I see the very same TimeoutException and IndexError (probably something messes up with TimeoutException up the stack). On Wed, Sep 14, 2016 at 6:44 PM, Mike Bayer <mba...@redhat.com> wrote: > > > On 09/14/2016 11:08 AM, Mike Bayer wrote: >> >> >> >> On 09/14/2016 09:15 AM, Sean Dague wrote: >>> >>> I noticed the following issues happening quite often now in the >>> opportunistic db tests for nova - >>> >>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22 >>> >>> >>> >>> It looks like some race has been introduced where the various db >>> connections are not fully isolated from each other like they used to be. >>> The testing magic for this is buried pretty deep in oslo.db. >> >> >> that error message occurs when a connection that is intended against a >> SELECT statement fails to provide a cursor.description attribute. It is >> typically a driver-level bug in the MySQL world and corresponds to >> mis-handled failure modes from the MySQL connection. >> >> By "various DB connections are not fully isolated from each other" are >> you suggesting that a single in-Python connection object itself is being >> shared among multiple greenlets? I'm not aware of a change in oslo.db >> that would be a relationship to such an effect. > > > So, I think by "fully isolated from each other" what you really mean is > "operations upon a connection are not fully isolated from the subsequent use > of that connection", since that's what I see in the logs. A connection is > attempting to be used during teardown to drop tables, however it's in this > essentially broken state from a PyMySQL perspective, which would indicate > something has gone wrong with this (pooled) connection in the preceding test > that could not be detected or reverted once the connection was returned to > the pool. > > From Roman's observation, it looks like a likely source of this corruption > is a timeout that is interrupting the state of the PyMySQL connection. In > the preceding stack trace, PyMySQL is encountering a raise as it attempts to > call "self._sock.recv_into(b)", and it seems like some combination of > eventlet's response to signals and the fixtures.Timeout() fixture is the > cause of this interruption. As an additional wart, something else is > getting involved and turning it into an IndexError, I'm not sure what that > part is yet though I can imagine that might be SQLAlchemy mis-interpreting > what it expects to be a PyMySQL exception class, since we normally look > inside of exception.args[0] to get the MySQL error code. With a blank > exception like fixtures.TimeoutException, .args is the empty tuple. > > The PyMySQL connection is now in an invalid state and unable to perform a > SELECT statement correctly, but the connection is not invalidated and is > instead returned to the connection pool in a broken state. So the > subsequent teardown, if it uses this same connection (which is likely), > fails because the connection has been interrupted in the middle of its work > and not given the chance to clean up. > > Seems like the use of fixtures.Timeout() fixture here is not organized to > work with a database operation in progress, especially an > eventlet-monkeypatched PyMySQL. Ideally, if something like a timeout due > to a signal handler occurs, the entire connection pool should be disposed > (quickest way, engine.dispose()), or at the very least (and much more > targeted), the connection that's involved should be invalidated from the > pool, e.g. connection.invalidate(). > > The change to the environment here would be that this timeout is happening > at all - the reason for that is not yet known. If oslo.db's version were > involved in this error, I would guess that it would be related to this > timeout condition being caused, and not anything to do with the connection > provisioning. > > > > > >> >> >> >>> >>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I >>> think we need to strongly consider blocking it and revisiting these >>> issues post newton. >>> >>> -Sean >>> > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev