I wrote a simple benchmark, to compare different approaches to
insert data; the scripts and the outputs are in the attachment.

What the test does: it creates the CastInfo table (and the required
RoleType table) in the database, and then inserts 100 lists,
each one containing 20.000 items of randomly generated information
(movieID, personID, role text, note text, roleID, ...) calling
the executemany method of the cursor object of the underling python
module.
It emulates more or less the 20% of the whole data we are working on.

The tested cases (to test the different scenarios some modifications
to the scripts are required):
1) SQLObject: it automatically generates the 'id' primary key and
   there are also indexes for columns personID and movieID.
   The indexes are updated at insert-time, because SQLObject does not
   support the creation of an index _after_.
2) SQLAlchemy 0.2: identical situation as the first test; the only
   difference is that it does not automatically create an 'id' primary key.
3) SQLAlchemy with indexes created _after_ the data insertion.

Results (tested with MySQL, because it seems to be the only database to
take real advantage of the executemany method):
case 1) initially every 20.000 data set takes ~2 seconds; after 50 blocks
        every insert can take 10-20 seconds; near the end even >50 secs.
        Total time: 16 minutes, 55 seconds.
case 2) approximately the same result as case 1, with slightly lower
        values, due to the fact that it does not creates/updates
        the 'id' primary key.
        Total time: 14 minutes, 40 seconds.
case 3) every 20.000 data set takes ~2 seconds, it doesn't matter if
        there are already a lot of data in the table.
        After that, the creation of the indexes took about 1 minute.
        Total time to insert the data: 2 minutes, 22 seconds.
        Time to create indexes: 1 min, 10 secs.
        Total time: 3 minutes, 32 seconds.

Conclusion: what we really need is to decouple insertion of the data
and generation of the indexes.
Notice that, when indexes are updated for every insert, the insert
time grows exponentially: the test run for only 100 cycles, but the
real data will require _500_ rounds!

Now my question is: does SQLObject _really_ not support creation of
indexes _after_ that the data is in the table?
Some weeks ago I've tried, but maybe I've missed something.

If it's not possible, I think that switching to SQLAlchemy is inevitable.


-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

Attachment: sql_tests.tgz
Description: GNU Unix tar archive

Reply via email to