On 9/1/2013 3:42 PM, Ben Companjen wrote: > Hi all, > > I created a Python script that reads a dump file and puts the edition > records in a MySQL database. > > It works (when you manually create the tables), but it's very slow: > 10000 records in about an hour, which means all editions will take > about 10 days of continuous operation. > > Does anybody have a faster way? Is there some script for this in the > repository?
Back in March 2010 I did a similar exercise, with some differences. My database schema was fairly complex (tables for authors, author names, works, expressions, some intersection tables, etc.), I used PostgreSql, and Java with JDBC. My intent was not to actually have a useful database, but 1. to refresh some Java and SQL skills, and 2. to prove that OL data could be put into a real database (right now, OL uses a database only as a kind of substitute file system). In my experiment I didn't turn off indexing before insertions, so I didn't get that optimization. Because of relational issues, I may have done two passes on the data dump, first parsing all of the author records before I parsed the edition records (I just don't recall). Running on an old Pentium \\\ I had in the basement, it took about 4 days to load all the OL data into the database. As I mentioned OL's database is really just a substitute file system. So I performed a second test where I parsed the dump file (again, using Java) and just copied each JSON object onto disk in a file structure that allowed fast retrieval (a trie index). This time, parsing and saving the file took about ... 4 days. Recently, I was involved in a project where data records were downloaded from a federal web site, and then loaded into an Oracle database. Using sqlldr we were able to load 5 million records into a single table (no constraints) in about 2 hours (with indexing turned off, then the indexing the entire table at the end). I'm guessing that the database is not the bottleneck, parsing the dump file is. The last time I did any research, a Python script runs at about 45 times slower than a C program and about 30 times slower than a Java program (the math indicates that Java is about 1.5 times slower that C). Python advocates argue that the performance difference doesn't make up for the time it takes to write a Python script compared to a Java program (an argument I'm not going to get into here), but if execution speed is a concern, then a more powerful programming language is going to make a difference. Whether the difference is meaningful is for you to decide. Here's my recommendation: Rewrite your Python script to parse the dump file and write out a .sql file that can be bulk loaded into MySQL. Limit yourself to about 100,000 records per .sql file. Time how long it takes to create each file (date/time stamp on each file is probably a good enough indicator). Only create enough files to test your performance. Use MySQL directly to bulk load a couple of the .sql files. Turn off indexing before the load, and reindex once the load is complete. Time how long the load takes. You now have a better idea of just where the bottleneck is; my hunch is that it's going to be in the parsing, not the database inserts. If you can, rewrite the parsing portion of the experiment in a compiled (C) or quasi-compiled (Java) language. Try the new program against the dump and see how long it takes to produce intermediate files. What you're trying to do is very disk intensive, so if you can put the dump file on one disk and the output file on a different disk, you may gain some performance. You could even do some poor man's parallel processing by parsing on one machine and writing the intermediate file to a shared directory on the database machine, then letting the database machine pick up and load the .sql files as they appear. _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
