Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Roman Podoliaka Wed, 21 Sep 2016 09:18:07 -0700

FWIW, there was no new failures in Nova jobs since then.

I'm confused as well why these tests would sporadically take much
longer time to execute. Perhaps we could install something like atop
on our nodes to answer that question.


On Wed, Sep 21, 2016 at 5:46 PM, Ihar Hrachyshka <[email protected]> wrote:
> I just hit that TimeoutException error in neutron functional tests:
>
> http://logs.openstack.org/68/373868/4/check/gate-neutron-dsvm-functional-ubuntu-trusty/4de275e/testr_results.html.gz
>
> It’s a bit weird that we hit that 180 sec timeout because in good runs, the
> test takes ~5 secs.
>
> Do we have a remedy against that kind of failure? I saw nova bumped the
> timeout length for the tests. Is it the approach we should apply across the
> board for other projects?
>
> Ihar
>
>
> Zane Bitter <[email protected]> wrote:
>
>> On 14/09/16 11:44, Mike Bayer wrote:
>>>
>>> On 09/14/2016 11:08 AM, Mike Bayer wrote:
>>>>
>>>> On 09/14/2016 09:15 AM, Sean Dague wrote:
>>>>>
>>>>> I noticed the following issues happening quite often now in the
>>>>> opportunistic db tests for nova -
>>>>>
>>>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> It looks like some race has been introduced where the various db
>>>>> connections are not fully isolated from each other like they used to
>>>>> be.
>>>>> The testing magic for this is buried pretty deep in oslo.db.
>>>>
>>>>
>>>> that error message occurs when a connection that is intended against a
>>>> SELECT statement fails to provide a cursor.description attribute.  It is
>>>> typically a driver-level bug in the MySQL world and corresponds to
>>>> mis-handled failure modes from the MySQL connection.
>>>>
>>>> By "various DB connections are not fully isolated from each other" are
>>>> you suggesting that a single in-Python connection object itself is being
>>>> shared among multiple greenlets?   I'm not aware of a change in oslo.db
>>>> that would be a relationship to such an effect.
>>>
>>>
>>> So, I think by "fully isolated from each other" what you really mean is
>>> "operations upon a connection are not fully isolated from the subsequent
>>> use of that connection", since that's what I see in the logs.  A
>>> connection is attempting to be used during teardown to drop tables,
>>> however it's in this essentially broken state from a PyMySQL
>>> perspective, which would indicate something has gone wrong with this
>>> (pooled) connection in the preceding test that could not be detected or
>>> reverted once the connection was returned to the pool.
>>>
>>> From Roman's observation, it looks like a likely source of this
>>> corruption is a timeout that is interrupting the state of the PyMySQL
>>> connection.   In the preceding stack trace, PyMySQL is encountering a
>>> raise as it attempts to call "self._sock.recv_into(b)", and it seems
>>> like some combination of eventlet's response to signals and the
>>> fixtures.Timeout() fixture is the cause of this interruption.   As an
>>> additional wart, something else is getting involved and turning it into
>>> an IndexError, I'm not sure what that part is yet though I can imagine
>>> that might be SQLAlchemy mis-interpreting what it expects to be a
>>> PyMySQL exception class, since we normally look inside of
>>> exception.args[0] to get the MySQL error code.   With a blank exception
>>> like fixtures.TimeoutException, .args is the empty tuple.
>>>
>>> The PyMySQL connection is now in an invalid state and unable to perform
>>> a SELECT statement correctly, but the connection is not invalidated and
>>> is instead returned to the connection pool in a broken state.  So the
>>> subsequent teardown, if it uses this same connection (which is likely),
>>> fails because the connection has been interrupted in the middle of its
>>> work and not given the chance to clean up.
>>>
>>> Seems like the use of fixtures.Timeout() fixture here is not organized
>>> to work with a database operation in progress, especially an
>>> eventlet-monkeypatched PyMySQL.   Ideally, if something like a timeout
>>> due to a signal handler occurs, the entire connection pool should be
>>> disposed (quickest way, engine.dispose()), or at the very least (and
>>> much more targeted), the connection that's involved should be
>>> invalidated from the pool, e.g. connection.invalidate().
>>>
>>> The change to the environment here would be that this timeout is
>>> happening at all - the reason for that is not yet known.   If oslo.db's
>>> version were involved in this error, I would guess that it would be
>>> related to this timeout condition being caused, and not anything to do
>>> with the connection provisioning.
>>>
>>>>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
>>>>> think we need to strongly consider blocking it and revisiting these
>>>>> issues post newton.
>>
>>
>> We've been seeing similar errors in Heat since at least Liberty
>> (https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some poking
>> around yesterday and basically confirmed his theory above. If you look at
>> the PyMySQL code, it believes that only an IOError can occur while writing
>> to a socket, so it has no handling for any other type of exception, thus it
>> can't deal with signal handlers raising exceptions or other exceptions being
>> thrown into the greenthread by eventlet. It sounds like sqlalchemy also
>> fails to catch at least some of these exceptions and invalidate the
>> connection.
>>
>> tl;dr this appears to have been around forever (at least since we switched
>> to using a pure-Python MySQL client) and is almost certainly completely
>> unrelated to any particular release of oslo.db.
>>
>> cheers,
>> Zane.
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: [email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Reply via email to