Hi all,
in the previous days, Giuseppe and me have worked on the sql data
access system, to make it db-independent, using SQLObject.

Unfortunately (after a lot of work was already done, sigh!) we are
now facing a problem: performances (at insert-time only [1]).

The fact is: we can completely rely on SQLObject to manage data
insertion, but it doesn't use the executemany() method of the
"cursor" object of the underling database's back end, making
it _really slow_ for every database server (mysql, postgresql,
sqlite, ...)
This is a pity because, at least for some database engines (MySQL,
for instance), the use of the executemany() method can speed up
the imdbpy2sql script by a factor of 4 or 5.

Another option is to extract the "cursor" object by the
SQLObject's connection, and directly use its executemany() method.
This would take advantage of db back ends with a fast
executemany method, while other database will still be slow (this
is inevitable, so we don't care).
At this point another problem arises: the style of the parameters
passed to the executemany() is somewhat free, and different modules
use different conventions. :-/


I fear that now we have only three options:
1. ignore performances, use only SQLObject and ignore the fact that
   MySQL (and maybe others DBs) can run in just 2/3 hours [2] instead
   of 8/12 hours.
2. go on using the cursor.executemany() method as described above,
   trying to (more or less manually) support every different
   parameters style.
3. drop SQLObject, and find a different tool.
   I've looked at SQLAlchemy, and it has a different, more layered,
   architecture: it can be used to abstract just the connection and
   the cursor, guaranteeing that the underling executemany() method
   is called.
   At an higher level it can mimic almost everything SQLObject do
   (I think), but _if_ we choose to switch to SQLAlchemy, I'll
   probably prefer to stay at the lower possible level.
   Except the facts that we're using SQLAlchemy and that we don't
   need to handle executemany's parameters by ourself, options 2
   and 3 are architecturally very similar (both are low-level, using
   cursor.executemany)


To summarize, I'm asking your opinion (especially to Giuseppe,
having he already written a lot of code).
To _me_ the first one sounds really bad; if you have had asked
me yesterday, I would have chosen the 2nd, but it seems prone
to errors; today I've given a first glance to SQLAlchemy at
it _seems_ suitable for the work [3], but I can be wrong: I never
used it.

Either case, some (or a lot of) work will be lost and the release
of 2.5 will be delayed.  That was a fault of me and I'm really
sorry for this: either I've chosen the wrong too or I've not
fully understood it before starting the new implementation.
Given my knowledge of SQLAlchemy - and databases and ORMs in
general - this situation can repeat itself, so I'm asking
your opinions [4].


+++
[1] we have to insert ~30mln of row.  And the db continuously grows...
[2] just an estimate, it can be longer.
[3] it uses a single parameter style, hiding the underling
    implementation.  I don't know a thing about performances.
[4] speak freely: I'll reserve to myself the right to kindly ignore
    your ideas. ;-)

-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Reply via email to