On Wed, December 8, 2010 9:06 am, Jeulin-L Michael wrote: > Hi, > > Thank you very much for your answers ! > Actually Lee was right; I didn't noticed it, but my search engine was not > capable of dealing with the size of the dump (31Gigs for the last one). > > For people who use this bulk are you reinjecting it into a proper database, > or are you abble to request efficiently the raw file ?
I don't think that the file on its own can reasonably be used directly. In OL's case, I believe they use a PostgreSQL database essentially as a file system. IIRC, the OL schema has an internal ID, the OpenLibrary ID, the record type, and the record data as columns in a table (maybe a modification date, I can't be sure). It's easy to get the blob of data associated with a single OpenLibrary ID, because that column of the table is indexed, but searching on the record data itself requires the support of external indexing routines. For myself, I performed two experiments with the file. In the first experiment, I wrote a Java program to parse the file and load it using JDBC into PostgreSQL as a proper WEMI database: Author records were loaded first, as other tables had foreign key constraints on them. "Work" records were primarily loaded into an 'expression' table, with small amounts of data placed in a 'work' table, and "Edition" records were primarily loaded into into a 'manifestation' table. 'authors' were tied to 'expressions' and 'manifestations' by an 'event' table, which also was used to record personal information such as birth dates, death dates and flourished dates. Subject classifications had their own table, related to work records. If anyone would like it, I'd be happy to upload my schema. On an older Linux box with standard 1200 rpm disk drives it took about 5 days to load all the data into the database. Interestingly, doing straight inserts the database slowed perceptibly over time. I worked around this by inserting a timing routine in the program I wrote, and when insertion time reached a certain threshold I paused, re-indexed and optimized the database, then continued. B-tree indexes as used by PostgreSQL degrade to almost a linear search when insertions are sequential; when this happens, they need to be balanced. For my second experiment, I figured that if OL was using a DBMS as a file system, maybe I could do the reverse and use the file system as the repository. I wrote another Java program to extract each record from the data dump and stored it on my hard drive with the file path being the OL record number. Thus, "OL24506210M" was stored as "books/2/4/5/0/6/2/1/OL24506210M.json". Omitting the last digit means storing 10 files plus up to 10 folders, named "0" to "9". OL is committed to storing historical data for every record. It accomplishes this by storing multiple copies of the same record each time it changes, differenciated by last modified date/time (or perhaps by revision number; I don't recall right now). I accomodated this by adding revision numbers to the file name, so any discrete folder could contain more than 10 files depending on the number of revisions of each. "books/2/4/5/0/6/2/1/" might therefore contain the files "OL24506210M.1.json", "OL24506210M.2.json" and "OL24506210M.3.json" in addition to others. Breaking the dump file down this way took about the same amount of time it took to load a proper database (about 5 days), despite the fact that I was working on a faster machine, and database indexing was not occuring. Searching for a specific record, however, was very fast as the file system path could be derived from the record number. Essentially the file system was being used as a trie index; great for fast record retrieval, but still requiring external indexes for ad hoc queries. If Anand could generate a tarball having this kind of directory structure, I think it could make the data much more useable to some (maybe all) people. I hope this has been interesting, if nothing else. Salut, Lee _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
