On Jul 24, 2009, at 9:43 PM, raj kumar wrote:
Over the next few years, I am tasked to download 30,000 archival
masters from Internet Archive into an archive for long-term staff
access that we may preserve with LOCKSS. These are masters of
Montana state publications. I have a hierarchy in mind to receive
these files. The hierarchy is state agency\year\title\pub_date\*.pdf.
I am intending to download the files in batches of 200 - 500 pdfs,
but am thinking that if I slot them automatically into the archive
hierarchy, misplaced or missing files could be very hard to find as
the total grows. I will be logging the downloads, which should give
me some control. Are there other strategies for ensuring that I can
readily correct download errors? I am looking for recommendations
for the simplest way to maintain reasonable control over the
download process.
A couple things:
If you already have archive.org identifiers picked out, you can use
something like this python script to download them all from IA:
http://blog.openlibrary.org/2008/11/24/bulk-access-to-ocr-for-1-million-books/
'Sounds fun, and such a project is something I advocate not only for
retrospective preservation purposes put for general collection
building as well, but that is another story.
Without some sort of metadata it will not be possible for you to save
your files in the hierarchy outlined above. State agency. Year. Title.
Publication date. One the other hand, if metadata containing these
values is readily accessible in the downloaded file itself or, as Ed
mentioned, a part of some sort of manifest (or MARC record), then you
are golden. I used Raj's script as a model for a similar process [1]:
* write a cool query against Open Library returning identifiers
* feed identifiers to mirroring program; I used wget
* download file as well as metadata
* parse metadata and process associated file accordingly
If you're really luck, then the "cool query" written against Open
Library will also return the necessary metadata and you could use that
as a guide to save your file
Good luck.
[1] similar process -
http://infomotions.com/blog/2009/06/interent-archive-content-in-discovery-systems/
--
Eric Lease Morgan