Re: [ol-tech] OL Dumps

Ben Companjen Tue, 30 Jul 2013 14:11:36 -0700

Hi Christian,

Great questions. I'm interested to see any result of your integration
when it's there.

Dumps are created automatically at the beginning of each month. You
could GET http://openlibrary.org/data/ol_dump_latest.txt.gz every 2nd
day of the month and be redirected to the new file, or check the RSS
feed http://archive.org/services/collection-rss.php?collection=ol_exports
(you'd have to selectively download the ol_dump_* files, not the
ol_cdump_* which contains all versions). You can be reasonably sure
that the dump changes every month, because there are 25k+ edits every
30 days, according to the statistics on the homepage.

Since all records are versioned and dated, producing incremental
updates is possible. Even though ResourceSync is newer than OAI-PMH,
at one point there were sitemaps available, so RS could be easier to
implement. I believe I remember seeing a commit on GitHub with the
comment "remove sitemaps", but I'm not so sure anymore. Perhaps the
sitemaps code is still around.
In the meantime, have you looked at the Recent Changes API?
http://openlibrary.org/dev/docs/api/recentchanges
You can request daily updates using this API.

Real-time updates would be really cool, using PubSub
(Publish-Subscribe), but there are enough other things to fix/improve:
https://github.com/internetarchive/openlibrary/issues?page=1&state=open
(in case you have spare developer time ;))
Feel free to add an issue for OAI-PMH / ResourceSync, but don't expect
immediate action. :)

Regards,

Ben

On 25 July 2013 09:12, Christian Tønsberg <[email protected]> wrote:
> Hi OpenLibrary Tech,
>
> Great effort!! Thank you!
>
> At DTU Library, we would very much like to include OL data into our search
> engine - to enrich the search experience for our users, and to drive (some)
> traffic for books back to the OL website.
>
> We discovered http://openlibrary.org/developers/dumps ; it looks great!
>
> For including into our search engine that kind of bibliographical data
> available online, we have developed robots/agents which are responsible for
> discovering if new data are available from our various sources (and if so,
> fetch the new material and ingest it into our processing pipeline,
> eventually leading to the indexing of that material).
>
> Announcing links to compressed dump files on a website (like
> onhttp://openlibrary.org/developers/dumps) is patrolable by our robots.  But
> rather inconvenient and error prone.
>
> Do you have any plans to expose these dump files through other protocols
> (FTP, http directory listings, etc) more suited for robots/agents?  And
> perhaps accompanied by digests (e.g. MD5) of the files, so robots can easily
> detect, if there are in fact new/changed material to download)?
>
> Any plans to supplement compressed full dumps by using protocols like
> OAI-PMH (http://www.openarchives.org/OAI/openarchivesprotocol.html) or
> ResourceSync (http://www.openarchives.org/rs/0.9/toc) to minimize the effort
> for clients when retrieving incremental updates?
>
> Again: Great effort!  And thanks in advance!
>
> Cheers,
>   Christian Tønsberg,
>   Manager IT systems
>   DTU Library
>
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to
> [email protected]
>
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] OL Dumps

Reply via email to