Re: [CODE4LIB] download design

raj kumar Fri, 24 Jul 2009 18:45:24 -0700

On Jul 24, 2009, at 2:20 PM, [Chris Stockwell] wrote:

Over the next few years, I am tasked to download 30,000 archivalmastersfrom Internet Archive into an archive for long-term staff accessthat we maypreserve with LOCKSS. These are masters of Montana statepublications. Ihave a hierarchy in mind to receive these files. The hierarchy isstate
agency\year\title\pub_date\*.pdf.
I am intending to download the files in batches of 200 - 500 pdfs,but amthinking that if I slot them automatically into the archivehierarchy, misplacedor missing files could be very hard to find as the total grows. Iwill be loggingthe downloads, which should give me some control. Are there otherstrategiesfor ensuring that I can readily correct download errors? I amlooking forrecommendations for the simplest way to maintain reasonable controlover the
download process.



A couple things:

If you already have archive.org identifiers picked out, you can usesomething like this python script to download them all from IA:

http://blog.openlibrary.org/2008/11/24/bulk-access-to-ocr-for-1-million-books/

You can use the archive.org advanced search engine to produce xml,json, or csv file with all identifiers for a particular contributinginstitution:

http://www.archive.org/advancedsearch.php

eg. all identifier for Montana State Library (http://www.archive.org/details/MontanaStateLibrary) as an xml file (change rows=10 to rows=10000 to get them all):

http://www.archive.org/advancedsearch.php?q=collection%3A%22montanastatelib%22&fl%5B%5D=creator&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=10&fmt=xml&xmlsearch=Search

Also, if you have an archive.org identifier, then you can get thefiles.xml that contains md5 and sha1 hashes, so you can verify yourdownload:


To pull the files.xml, use a /download/id/id_files.xml url. e.g.:

http://www.archive.org/download/librariesoffutur00lickuoft/librariesoffutur00lickuoft_files.xml

-raj

Re: [CODE4LIB] download design

Reply via email to