I'm sorry, but this kind of talk drives me nuts. Performance analysis is, or should be, driven by facts and measurements, not unsubstantiated generalizations.
On Tue, Oct 8, 2013 at 4:55 PM, Lee Passey <[email protected]> wrote: > > In my experiment I didn't turn off indexing before insertions, so I > didn't get that optimization. Because of relational issues, I may have > done two passes on the data dump, first parsing all of the author > records before I parsed the edition records (I just don't recall). > Running on an old Pentium \\\ I had in the basement, it took about 4 > days to load all the OL data into the database. > Not turning off indexing creates huge overhead and a PIII is orders of magnitude slower than a modern computer. > Recently, I was involved in a project where data records were downloaded > from a federal web site, and then loaded into an Oracle database. Using > sqlldr we were able to load 5 million records into a single table (no > constraints) in about 2 hours (with indexing turned off, then the > indexing the entire table at the end). > That sounds like a reasonable result, but it's not at all comparable to the previous experiment. > I'm guessing that the database is not the bottleneck, parsing the dump > file is. > What value does guessing have? > The last time I did any research, a Python script runs at about 45 times > slower than a C program and about 30 times slower than a Java program > (the math indicates that Java is about 1.5 times slower that C). Do you have a citation? Whoever said this didn't know anything about benchmarking or performance analysis. Relative performance is totally dependent on workload so anytime you hear someone say language X is N times faster than language Y without any qualification or context, you immediately know you are best off ignoring anything they say relative to performance. Here are some actual measured times using the current (Sept. 30) OpenLibrary dump (6.9 GB compressed, 48.5 M records) on my laptop while I was doing other work: 5 min 26 sec - Decompress and count 48.5M lines using Unix commands (zcat | wc -l) 7 min 51 sec - Decompress and count using Python (using one core vs two in the Unix pipe scenario) 29 min 22 sec - Decompress, count all record types, filter to editions, deserialize/parse JSON for editions into Python dict I'll argue that 5 1/2 minutes is an absolute, never to be achieved, lower bound for the parsing task. If <language of your choice> is twice as fast as Python, you could save 15 minutes. If it was infinitely faster, you could save an additional 10 minutes. I would be absolutely stunned if this half hour weren't dwarfed by the database load time (and would want to see the numbers). Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
