I'm sorry, but this kind of talk drives me nuts.  Performance analysis is,
or should be, driven by facts and measurements, not unsubstantiated
generalizations.

On Tue, Oct 8, 2013 at 4:55 PM, Lee Passey <[email protected]> wrote:

>
> In my experiment I didn't turn off indexing before insertions, so I
> didn't get that optimization. Because of relational issues, I may have
> done two passes on the data dump, first parsing all of the author
> records before I parsed the edition records (I just don't recall).
> Running on an old Pentium \\\ I had in the basement, it took about 4
> days to load all the OL data into the database.
>

Not turning off indexing creates huge overhead and a PIII is orders of
magnitude slower than a modern computer.


> Recently, I was involved in a project where data records were downloaded
> from a federal web site, and then loaded into an Oracle database. Using
> sqlldr we were able to load 5 million records into a single table (no
> constraints) in about 2 hours (with indexing turned off, then the
> indexing the entire table at the end).
>

That sounds like a reasonable result, but it's not at all comparable to the
previous experiment.


> I'm guessing that the database is not the bottleneck, parsing the dump
> file is.
>

What value does guessing have?


> The last time I did any research, a Python script runs at about 45 times
> slower than a C program and about 30 times slower than a Java program
> (the math indicates that Java is about 1.5 times slower that C).


Do you have a citation?  Whoever said this didn't know anything about
benchmarking or performance analysis.  Relative performance is totally
dependent on workload so anytime you hear someone say language X is N times
faster than language Y without any qualification or context, you
immediately know you are best off ignoring anything they say relative to
performance.

Here are some actual measured times using the current (Sept. 30)
OpenLibrary dump (6.9 GB compressed, 48.5 M records) on my laptop while I
was doing other work:

 5 min 26 sec - Decompress and count 48.5M lines using Unix commands (zcat
| wc -l)
 7 min 51 sec - Decompress and count using Python (using one core vs two in
the Unix pipe scenario)
29 min 22 sec - Decompress, count all record types, filter to editions,
deserialize/parse JSON for editions into Python dict

I'll argue that 5 1/2 minutes is an absolute, never to be achieved, lower
bound for the parsing task.  If <language of your choice> is twice as fast
as Python, you could save 15 minutes.  If it was infinitely faster, you
could save an additional 10 minutes.

I would be absolutely stunned if this half hour weren't dwarfed by the
database load time (and would want to see the numbers).

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to