I wrote a simple benchmark, to compare different approaches to insert data; the scripts and the outputs are in the attachment.
What the test does: it creates the CastInfo table (and the required RoleType table) in the database, and then inserts 100 lists, each one containing 20.000 items of randomly generated information (movieID, personID, role text, note text, roleID, ...) calling the executemany method of the cursor object of the underling python module. It emulates more or less the 20% of the whole data we are working on. The tested cases (to test the different scenarios some modifications to the scripts are required): 1) SQLObject: it automatically generates the 'id' primary key and there are also indexes for columns personID and movieID. The indexes are updated at insert-time, because SQLObject does not support the creation of an index _after_. 2) SQLAlchemy 0.2: identical situation as the first test; the only difference is that it does not automatically create an 'id' primary key. 3) SQLAlchemy with indexes created _after_ the data insertion. Results (tested with MySQL, because it seems to be the only database to take real advantage of the executemany method): case 1) initially every 20.000 data set takes ~2 seconds; after 50 blocks every insert can take 10-20 seconds; near the end even >50 secs. Total time: 16 minutes, 55 seconds. case 2) approximately the same result as case 1, with slightly lower values, due to the fact that it does not creates/updates the 'id' primary key. Total time: 14 minutes, 40 seconds. case 3) every 20.000 data set takes ~2 seconds, it doesn't matter if there are already a lot of data in the table. After that, the creation of the indexes took about 1 minute. Total time to insert the data: 2 minutes, 22 seconds. Time to create indexes: 1 min, 10 secs. Total time: 3 minutes, 32 seconds. Conclusion: what we really need is to decouple insertion of the data and generation of the indexes. Notice that, when indexes are updated for every insert, the insert time grows exponentially: the test run for only 100 cycles, but the real data will require _500_ rounds! Now my question is: does SQLObject _really_ not support creation of indexes _after_ that the data is in the table? Some weeks ago I've tried, but maybe I've missed something. If it's not possible, I think that switching to SQLAlchemy is inevitable. -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/
sql_tests.tgz
Description: GNU Unix tar archive