On Wed, Oct 10, 2012 at 2:42 AM, Ben Companjen <[email protected]>wrote:
> Hi Dave, > > The dump file formats are partially explained on > http://openlibrary.org/developers/dumps (i.e. the lines are tab > separated, with records at the end), partially documented by record > type on http://openlibrary.org/type and partially undocumented (there > is an issue on GitHub for this: > https://github.com/internetarchive/openlibrary/issues/100). > In the past months I have created some statistics for each dump (not a > complete dump with all revisions, just the most recent revisions). For > September's datadump it's here: https://gist.github.com/3863520 (with > a short readme). It doesn't give a full file structure but gives an > impression what you can find in the dumps next to the "ideal" record > format defined by http://openlibrary.org/type. > > What are the three different counts in the statistics? There's a lot of cruft in the database, so it might be easier for people to comprehend if you trimmed the low frequency noise items or sorted things by frequency. Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
