On 9/1/2013 3:42 PM, Ben Companjen wrote:

> Hi all,
>
> I created a Python script that reads a dump file and puts the edition
> records in a MySQL database.
>
> It works (when you manually create the tables), but it's very slow:
> 10000 records in about an hour, which means all editions will take
> about 10 days of continuous operation.
>
> Does anybody have a faster way? Is there some script for this in the 
> repository?

Back in March 2010 I did a similar exercise, with some differences. My 
database schema was fairly complex (tables for authors, author names, 
works, expressions, some intersection tables, etc.), I used PostgreSql, 
and Java with JDBC. My intent was not to actually have a useful 
database, but 1. to refresh some Java and SQL skills, and 2. to prove 
that OL data could be put into a real database (right now, OL uses a 
database only as a kind of substitute file system).

In my experiment I didn't turn off indexing before insertions, so I 
didn't get that optimization. Because of relational issues, I may have 
done two passes on the data dump, first parsing all of the author 
records before I parsed the edition records (I just don't recall). 
Running on an old Pentium \\\ I had in the basement, it took about 4 
days to load all the OL data into the database.

As I mentioned OL's database is really just a substitute file system. So 
I performed a second test where I parsed the dump file (again, using 
Java) and just copied each JSON object onto disk in a file structure 
that allowed fast retrieval (a trie index). This time, parsing and 
saving the file took about ... 4 days.

Recently, I was involved in a project where data records were downloaded 
from a federal web site, and then loaded into an Oracle database. Using 
sqlldr we were able to load 5 million records into a single table (no 
constraints) in about 2 hours (with indexing turned off, then the 
indexing the entire table at the end).

I'm guessing that the database is not the bottleneck, parsing the dump 
file is.

The last time I did any research, a Python script runs at about 45 times 
slower than a C program and about 30 times slower than a Java program 
(the math indicates that Java is about 1.5 times slower that C). Python 
advocates argue that the performance difference doesn't make up for the 
time it takes to write a Python script compared to a Java program (an 
argument I'm not going to get into here), but if execution speed is a 
concern, then a more powerful programming language is going to make a 
difference. Whether the difference is meaningful is for you to decide.

Here's my recommendation:

Rewrite your Python script to parse the dump file and write out a .sql 
file that can be bulk loaded into MySQL. Limit yourself to about 100,000 
records per .sql file. Time how long it takes to create each file 
(date/time stamp on each file is probably a good enough indicator). Only 
create enough files to test your performance.

Use MySQL directly to bulk load a couple of the .sql files. Turn off 
indexing before the load, and reindex once the load is complete. Time 
how long the load takes. You now have a better idea of just where the 
bottleneck is; my hunch is that it's going to be in the parsing, not the 
database inserts.

If you can, rewrite the parsing portion of the experiment in a compiled 
(C) or quasi-compiled (Java) language. Try the new program against the 
dump and see how long it takes to produce intermediate files.

What you're trying to do is very disk intensive, so if you can put the 
dump file on one disk and the output file on a different disk, you may 
gain some performance. You could even do some poor man's parallel 
processing by parsing on one machine and writing the intermediate file 
to a shared directory on the database machine, then letting the database 
machine pick up and load the .sql files as they appear.

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to