On Jul 24, 2009, at 2:20 PM, [Chris Stockwell] wrote:
Over the next few years, I am tasked to download 30,000 archival
masters
from Internet Archive into an archive for long-term staff access
that we may
preserve with LOCKSS. These are masters of Montana state
publications. I
have a hierarchy in mind to receive these files. The hierarchy is
state
agency\year\title\pub_date\*.pdf.
I am intending to download the files in batches of 200 - 500 pdfs,
but am
thinking that if I slot them automatically into the archive
hierarchy, misplaced
or missing files could be very hard to find as the total grows. I
will be logging
the downloads, which should give me some control. Are there other
strategies
for ensuring that I can readily correct download errors? I am
looking for
recommendations for the simplest way to maintain reasonable control
over the
download process.
A couple things:
If you already have archive.org identifiers picked out, you can use
something like this python script to download them all from IA:
http://blog.openlibrary.org/2008/11/24/bulk-access-to-ocr-for-1-million-books/
You can use the archive.org advanced search engine to produce xml,
json, or csv file with all identifiers for a particular contributing
institution:
http://www.archive.org/advancedsearch.php
eg. all identifier for Montana State Library (http://www.archive.org/details/MontanaStateLibrary
) as an xml file (change rows=10 to rows=10000 to get them all):
http://www.archive.org/advancedsearch.php?q=collection%3A%22montanastatelib%22&fl%5B%5D=creator&fl%5B%5D=identifier&fl%5B%5D=title&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=10&fmt=xml&xmlsearch=Search
Also, if you have an archive.org identifier, then you can get the
files.xml that contains md5 and sha1 hashes, so you can verify your
download:
To pull the files.xml, use a /download/id/id_files.xml url. e.g.:
http://www.archive.org/download/librariesoffutur00lickuoft/librariesoffutur00lickuoft_files.xml
-raj