RE: [sqlalchemy] Best design for commits?

King Simon-NFHD78 Wed, 20 Apr 2011 10:03:38 -0700

> -----Original Message-----
> From: sqlalchemy@googlegroups.com [mailto:sqlalchemy@googlegroups.com]
> On Behalf Of Aviv Giladi
> Sent: 20 April 2011 15:53
> To: sqlalchemy
> Subject: [sqlalchemy] Best design for commits?
> 
> Hey guys,
> 
> I have a Pylons back-end running on SQLAlchemy. I have a script that
> reads a tree of XML files from an HTTP server (it downloads an xml X,
> and then downloads that X's children, and then iterates the children,
> and so forth in recursion). Each xml file represents an SQLAlchemy
> model.
> 
> The problem is that I have thousands of these xml's (sometimes 5000,
> sometimes 26000). I was able to optimize the download process with
> HTTP pooling, but I cannot seem to think of the best approach as to
> committing the models to the DB. Every time an xml file is downloaded,
> I create an orm object for it and add it to my session.
> 
> Problem 1: some xml's will exists multiple times in the tree so I am
> checking that there is no duplicate insertion. Is the check in my
> code
> optimal or should I keep an indexed collection on the side and use it
> to check for duplicates?
> 
> Problem 2: my autocommit is set to False because I don't want to
> commit on every add (not because its bad design, but because of
> performance). But I also don't want to iterate the entire tree of
> thousands of categories without committing at all. Therefor, I
> created
> a constant number upon which my code commits the data. Is this a good
> approach? What would be a good number for that? It might be important
> to mention that I do not know in advance how many xml's I am looking
> at.
> 
> Here is what my pseudo-code looks like now (ignore syntax errors):
> 
>     count = 0
>     COMMIT_EVERY = 50
> 
>     def recursion(parent):
>         global count, COMMIT_EVERY
>         pool = get_http_connection_pool(...)
>         sub_xmls = get_sub_xmls(pool, parent)
> 
>         if sub_xmls == None:
>             return
> 
>         for sub_xml in sub_xmls:
>             orm_obj = MyObj(sub_xml)
> 
>             duplicate = Session.query(MyObj).filter(MyObj.id ==
> orm_obj.id).first()
>             if not duplicate:
>                 Session.add(orm_obj)
>                 count = count + 1
>                 if count % COMMIT_EVERY == 0:
>                     Session.commit()
>                 recursion(orm_obj.id)
> 
>     recursion(0)
>


I'm not sure I can comment on the overall approach, but there are a
couple of things that might help you.

1. If you use Query.get rather than Query.filter, you won't actually
query the database when the object already exists in the session. You'll
probably need to clear the session every now and then (I don't think
flush() or commit() clear it, but I could be wrong)

2. You may want to distinguish Session.flush() from Session.commit() -
you could flush every N new objects, and only commit once at the very
end. 

3. If you know you are the only person writing to the database, consider
setting expire_on_commit=False on your session. Otherwise I think
accessing orm_obj.id after Session.commit() will trigger another
(possibly unnecessary) query to the database.

Hope that helps,

Simon

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com.
To unsubscribe from this group, send email to 
sqlalchemy+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

RE: [sqlalchemy] Best design for commits?

Reply via email to