The internetarchive [1] python library that folks have already mentioned is pretty nice for working with IA collections.
For a small project I needed to download the metadata for a collection created by the National Agriculture Library, and write it out to the filesystem as JSON in a pairtree [2]. Perhaps it helps illustrate how to use it as a library as opposed to the command line? https://github.com/UMD-DCIC/seed-catalogs/blob/master/fetch.py <https://github.com/UMD-DCIC/seed-catalogs/blob/master/fetch.py> //Ed [1] https://internetarchive.readthedocs.io/ <https://internetarchive.readthedocs.io/> [2] https://confluence.ucop.edu/display/Curation/PairTree <https://confluence.ucop.edu/display/Curation/PairTree> > On Sep 18, 2017, at 3:37 PM, Eric Lease Morgan <emor...@nd.edu> wrote: > > Is there an Internet Archive API that will allow me to get the contents of a > collection as a stream of data and not as a stream of HTML. > > A cool collection of early English print materials is available at the > following URL: > > https://archive.org/details/bplsceep > > Each item is associated with an Internet Archive identifier. If I were able > to easily extract these identifiers, then I would be more easily able to > provide services based on the collection. But I’m lazy. I don’t want to read > the HTML and scrape it accordingly. Ick! I’d rather be given the list of > bibliographics in a more computer-friendly way. > > Again, can I programmatically read the contents of a Internet Archive > collection? > > — > Eric Morgan