Re: [ol-tech] Bulk Download and Request

Lee Passey Wed, 08 Dec 2010 10:58:56 -0800

On Wed, December 8, 2010 9:06 am, Jeulin-L Michael wrote:

> Hi,
>
> Thank you very much for your answers !
> Actually Lee was right; I didn't noticed it, but my search engine was not
> capable of dealing with the size of the dump (31Gigs for the last one).
>
> For people who use this bulk are you reinjecting it into a proper database,
> or are you abble to request efficiently the raw file ?


I don't think that the file on its own can reasonably be used directly. In
OL's case, I believe they use a PostgreSQL database essentially as a file
system. IIRC, the OL schema has an internal ID, the OpenLibrary ID, the record
type, and the record data as columns in a table (maybe a modification date, I
can't be sure). It's easy to get the blob of data associated with a single
OpenLibrary ID, because that column of the table is indexed, but searching on
the record data itself requires the support of external indexing routines.

For myself, I performed two experiments with the file. In the first
experiment, I wrote a Java program to parse the file and load it using JDBC
into PostgreSQL as a proper WEMI database: Author records were loaded first,
as other tables had foreign key constraints on them. "Work" records were
primarily loaded into an 'expression' table, with small amounts of data placed
in a 'work' table, and "Edition" records were primarily loaded into into a
'manifestation' table. 'authors' were tied to 'expressions' and
'manifestations' by an 'event' table, which also was used to record personal
information such as birth dates, death dates and flourished dates. Subject
classifications had their own table, related to work records. If anyone would
like it, I'd be happy to upload my schema.

On an older Linux box with standard 1200 rpm disk drives it took about 5 days
to load all the data into the database. Interestingly, doing straight inserts
the database slowed perceptibly over time. I worked around this by inserting a
timing routine in the program I wrote, and when insertion time reached a
certain threshold I paused, re-indexed and optimized the database, then
continued. B-tree indexes as used by PostgreSQL degrade to almost a linear
search when insertions are sequential; when this happens, they need to be
balanced.

For my second experiment, I figured that if OL was using a DBMS as a file
system, maybe I could do the reverse and use the file system as the
repository. I wrote another Java program to extract each record from the data
dump and stored it on my hard drive with the file path being the OL record
number. Thus, "OL24506210M" was stored as
"books/2/4/5/0/6/2/1/OL24506210M.json". Omitting the last digit means storing
10 files plus up to 10 folders, named "0" to "9".

OL is committed to storing historical data for every record. It accomplishes
this by storing multiple copies of the same record each time it changes,
differenciated by last modified date/time (or perhaps by revision number; I
don't recall right now). I accomodated this by adding revision numbers to the
file name, so any discrete folder could contain more than 10 files depending
on the number of revisions of each. "books/2/4/5/0/6/2/1/" might therefore
contain the files "OL24506210M.1.json", "OL24506210M.2.json" and
"OL24506210M.3.json" in addition to others.

Breaking the dump file down this way took about the same amount of time it
took to load a proper database (about 5 days), despite the fact that I was
working on a faster machine, and database indexing was not occuring. Searching
for a specific record, however, was very fast as the file system path could be
derived from the record number. Essentially the file system was being used as
a trie index; great for fast record retrieval, but still requiring external
indexes for ad hoc queries.

If Anand could generate a tarball having this kind of directory structure, I
think it could make the data much more useable to some (maybe all) people.

I hope this has been interesting, if nothing else.

Salut,
Lee

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Bulk Download and Request

Reply via email to