Re: [CODE4LIB] internet archive api

Edward Summers Mon, 18 Sep 2017 14:08:29 -0700

The internetarchive [1] python library that folks have already mentioned is 
pretty nice for working with IA collections.


For a small project I needed to download the metadata for a collection created 
by the National Agriculture Library, and write it out to the filesystem as JSON 
in a pairtree [2]. Perhaps it helps illustrate how to use it as a library as 
opposed to the command line?

    https://github.com/UMD-DCIC/seed-catalogs/blob/master/fetch.py 
<https://github.com/UMD-DCIC/seed-catalogs/blob/master/fetch.py>

//Ed

[1] https://internetarchive.readthedocs.io/ 
<https://internetarchive.readthedocs.io/>
[2] https://confluence.ucop.edu/display/Curation/PairTree 
<https://confluence.ucop.edu/display/Curation/PairTree> 

> On Sep 18, 2017, at 3:37 PM, Eric Lease Morgan <[email protected]> wrote:
> 
> Is there an Internet Archive API that will allow me to get the contents of a 
> collection as a stream of data and not as a stream of HTML.
> 
> A cool collection of early English print materials is available at the 
> following URL:
> 
>  https://archive.org/details/bplsceep
> 
> Each item is associated with an Internet Archive identifier. If I were able 
> to easily extract these identifiers, then I would be more easily able to 
> provide services based on the collection. But I’m lazy. I don’t want to read 
> the HTML and scrape it accordingly. Ick! I’d rather be given the list of 
> bibliographics in a more computer-friendly way.
> 
> Again, can I programmatically read the contents of a Internet Archive 
> collection?
> 
> —
> Eric Morgan

Re: [CODE4LIB] internet archive api

Reply via email to