I'd like to add some context around Vish's point 1)

> 1) Switch to using scoped sessions in sqlalchemy (the naive version of this 
> apparently breaks migrations)

We've been fighting a problem in our Diablo.Final+ based deployment that seems 
to be a lot like:

https://bugs.launchpad.net/nova/+bug/855660

in as much as it's a random but repeatable error when running 
euca-describe-instances, the difference being that it's a lazy load of 
'network' that fails rather than 'fixed_ips'.   However the Diablo.Final 
version of nova/db/sqlalchemy/api:instance_get_all_by_filters does still have 
all of the joinedloads still included that were sited in the bug report as 
being the reason.

The fix to #855660 doesn't directly address this, as it instead completes the 
re-implementation of refactoring the EC2 API to use the network API to match a 
similar change in OSAPI (#854585),  which changes the whole schematics of how 
the network data is retrieved.

We decided against simply back-porting those fixes as it looked pretty 
complicated and we were worried that there was a more general issue lurking in 
here around lazy evaluation that would bite later in a different query.

Our knowledge of the codebase hasn't let us drive this down to a root cause 
yet, but here's what we have observed so far:
- The is generated from an attempt to lazy load an object after the DB session 
has finished, triggered by an access to the network object from the EC2 api 
code shortly after the DB api call has returned.   My understanding (mainly 
from Vish) is that Nova aims to avoid any need for lazy loading by always fully 
loading the object in the DB api layer, and the code seems to look like it 
would do this.  Hence something is either sometimes removing this object after 
the db api call completed, or the object isn't properly loaded in the first 
case.   However normally it works most of the time.

 - It seems to be quite elusive to reproduce;  We've had systems where we it 
was a pretty solid failure (always fails after 10-15 calls to 
euca-describe-instances.   However adding any debugging seems to clear the 
problem, and in many cases after removing the additional debugging the problem 
can't be reproduced.    This is on static, dedicated systems, where there are 
no new instances being created and no other calls being made to the API.

 - Although it feels like a timing issue of some sort, it doesn't seem to be 
directly thread related.  On systems that are showing the problem a series of 
sequential calls to euca-describe-instances will trigger it.

- Thinking that it might be caused by accessing the network object too long 
after the session finishes, I've tried adding a 20 second delay in the ec2 api 
code between the db api call and the access to the network object, but that 
works fine.

- On a system where the problem was solidly reproducible we found that either 
of the following changes seemed to fix the problem (in that adding the change 
make it always work, and backing out the change made it fail again):

        1) Changing the session initialization in db/sqlalchemy/session.py to 
scoped sessions


        2) Changing the FixedIP model in db/sqlaclhemy/models.py from:
                
                network = relationship(Network, backref=backref('fixed_ips'))
        to:
                network = relationship(Network, backref=backref('fixed_ips'), 
lazy="joined")


We started looking at 1) when we thought the issue might be due to thread 
contention, but given that we see the issue even without any threading going on 
its hard to see why this seems to fix the issue.   It does however break the 
unit tests in a fairly big way around migration (although a real migration 
works fine).

Currently we're working with 2) - which also seems to be a solid fix.

So gut feel at this stage is that there is some small timing window between the 
db api call and the read of the network object which triggers a lazy load 
(either because for some reason it hasn't been loaded yet, or because it was 
loaded and went away) which is invalid because the session has ended at (or 
sometime after) the db api call completes.

I know that this issue will go away with the DB refactoring in Essex, but it 
would be really good in the meantime to know what was behind the failures we've 
seen - so if anyone in this group has ideas on what might be causing it, what 
has caused similar issues in the past, comments on the above, etc we'd be 
really glad to hear them.

Cheers,
Phil 
 




-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf 
Of Monsyne Dragon
Sent: 02 November 2011 21:21
To: Vishvananda Ishaya
Cc: <[email protected]>
Subject: Re: [Nova-database] Organizing/Meeting

Yah,  a meeting would be good.  I think there is some things that can be 
improved here, staunching some issues with the db layer that I think 'leak' 
into the design of the rest of Nova...

Some random thoughts to ponder on the below items: 


On Nov 2, 2011, at 3:22 PM, Vishvananda Ishaya wrote:

> Hey Guys,
> 
> It would probably be good to schedule an irc meeting to get the ball rolling 
> on some db changes.  Based on some discussions that I've had recently, I see 
> the following potential acition items.  Not all of these have been turned 
> into blueprints yet:
> 
> 1) Switch to using scoped sessions in sqlalchemy (the naive version of this 
> apparently breaks migrations)

+1 

> 2) Try using the pure pyhon mysql driver so eventlet can monkeypatch the 
> calls (this probably requires 1)

+1

> 3) Sanitize all objects to dictionaries coming out of the db layer (the naive 
> version of this is just wrapping all of the return values in dict())

Do we *really* want to do this?  I would suggest going to other way.  
SQLAlchemy is a data mapper ORM, if we split the mapping from the models, the 
models can be Plain Old Python Objects, independent of the persistence layer,  
that any db driver layer could accept and return.  That way, we could have 
*real* model objects  with *methods*, which would go a long way towards helping 
apply #7 below, amongst other things. 

> 4) Remove unused / stale db calls

Also good. 

> 5) Break db.api into multiple files

Yes, the db.api is getting to be unwieldily.  Random idea:  Make the api's be 
classes. There is alot of duplication of basic  xxxx_get(), xxxx_get_all(), etc 
methods. Inheritance could help. 
Second random thought:  the db methods could be instance methods of those 
classes.  since the context object is needed for all db calls anyway,  
instances of the db-api classes could be be fetched from a method or property  
on the context object itself, with the context passing itself into the 
constructor of the api object. 

i.e.    db.instance_get(context, value)   becomes some variant of: 

context.instance_api.get(value)  (or context.db_api('instance').get()  or 
context.db.instance.get()  .... you get the idea)  

> 6) Test the db layer directly (This will give us a good record of expected 
> objects returned from the db layer)

+1e9  Really. 

> ---
> 7) Use the law of demeter for db objects instead of indirectly accessing 
> subobjects (This implies a heavy performance penalty, so we will probably 
> need smart caching where we joinedload objects when possible and return the 
> cached object instead of reloading)
> 8) Implement a second db driver (zookeeper)
> 9) Split the dbs for different components into separate databases (this is a 
> heavy change and will require code changes throughout the code)
> 
> 7-9 are definitely longer term goals, and they probably won't make it into 
> the essex timeframe.  I think 1 through 6 are all doable in this release, and 
> we may be able to make some progress on the others as well.
> 
> Vish
> 
> 
> -- 
> Mailing list: https://launchpad.net/~nova-database
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~nova-database
> More help   : https://help.launchpad.net/ListHelp

--
        Monsyne M. Dragon
        OpenStack/Nova 
        cell 210-441-0965
        work x 5014190


-- 
Mailing list: https://launchpad.net/~nova-database
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~nova-database
More help   : https://help.launchpad.net/ListHelp

-- 
Mailing list: https://launchpad.net/~nova-database
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~nova-database
More help   : https://help.launchpad.net/ListHelp

Reply via email to