On Wed, Oct 10, 2012 at 2:42 AM, Ben Companjen <[email protected]>wrote:

> Hi Dave,
>
> The dump file formats are partially explained on
> http://openlibrary.org/developers/dumps (i.e. the lines are tab
> separated, with records at the end), partially documented by record
> type on http://openlibrary.org/type and partially undocumented (there
> is an issue on GitHub for this:
> https://github.com/internetarchive/openlibrary/issues/100).
> In the past months I have created some statistics for each dump (not a
> complete dump with all revisions, just the most recent revisions). For
> September's datadump it's here: https://gist.github.com/3863520 (with
> a short readme). It doesn't give a full file structure but gives an
> impression what you can find in the dumps next to the "ideal" record
> format defined by http://openlibrary.org/type.
>
> What are the three different counts in the statistics?

There's a lot of cruft in the database, so it might be easier for people to
comprehend if you trimmed the low frequency noise items or sorted things by
frequency.

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to