hi all -

I’ve been researching this cinder issue 
https://bugs.launchpad.net/cinder/+bug/1417018 and I’ve found something 
concerning.

Basically there’s a race condition here which is occurring because a single 
MySQL connection is shared between two green threads.  This occurs while Cinder 
is doing its startup work in cinder/volume/manager.py -> init_host(), and at 
the same time a request comes in from a separate service call that seems to be 
part of the startup.

The log output at http://paste.openstack.org/show/175571/ shows this happening. 
 I can break it down:


1. A big query for volumes occurs as part of 
"self.db.volume_get_all_by_host(ctxt, self.host)”.   To reproduce the error 
more regularly I’ve placed it into a loop of 100 calls.  We can see that thread 
id is 68089648 MySQL connection is 3a9c5a0.

2015-02-16 19:32:47.236 INFO sqlalchemy.engine.base.Engine 
[req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] tid: 68089648, connection: 
<_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt SELECT 
volumes.created_at AS 
2015-02-16 19:32:47.237 INFO sqlalchemy.engine.base.Engine 
[req-ed3c0248-6ee5-4063-80b5-77c5c9a23c81 None None] 
('localhost.localdomain@ceph', 'localhost.localdomain@ceph#%’)

2. A “ping” query comes in related to a different API call - different thread 
ID, *same* connection

2015-02-16 19:32:47.276 INFO sqlalchemy.engine.base.Engine 
[req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] tid: 68081456, connection: 
<_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt SELECT 1
2015-02-16 19:32:47.279 INFO sqlalchemy.engine.base.Engine 
[req-600ef638-cb45-4a34-a3ab-6d22d83cfd00 None None] ()

3. The first statement is still in the middle of invocation, so we get a 
failure, either a mismatch of the statement to the cursor, or a MySQL lost 
connection (stack trace begins)

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in 
fire_timers
 … more stack trace

4. another thread id, *same* connection.

2015-02-16 19:32:47.290 INFO sqlalchemy.engine.base.Engine 
[req-f980de7c-151d-4fed-b45e-d12b133859a6 None None] tid: 61238160, connection: 
<_mysql.connection open to '127.0.0.1' at 3a9c5a0>, stmt SELECT 1

    rows = [process[0](row, None) for row in fetch]
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/loading.py", line 
363, in _instance
    tuple([row[column] for column in pk_cols])
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/result.py", line 
331, in _key_fallback
    expression._string_or_unprintable(key))
NoSuchColumnError: "Could not locate column in row for column 'volumes.id'"
2015-02-16 19:32:47.293 ERROR cinder.openstack.common.threadgroup [-] "Could 
not locate column in row for column 'volumes.id’"


So I’m not really sure what’s going on here.  Cinder seems to have some 
openstack greenlet code of its own in cinder/openstack/common/threadgroup.py, I 
don’t know the purpose of this code.   SQLAlchemy’s connection pool has been 
tested a lot with eventlet / gevent and this has never been reported before.    
 This is a very severe race and I’d think that this would be happening all over 
the place.

Current status is that I’m continuing to try to determine why this is happening 
here, and seemingly nowhere else.




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to