Re: [CODE4LIB] download design

Eric Lease Morgan Sat, 25 Jul 2009 04:58:58 -0700

On Jul 24, 2009, at 9:43 PM, raj kumar wrote:

Over the next few years, I am tasked to download 30,000 archivalmasters from Internet Archive into an archive for long-term staffaccess that we may preserve with LOCKSS. These are masters ofMontana state publications. I have a hierarchy in mind to receivethese files. The hierarchy is state agency\year\title\pub_date\*.pdf.
I am intending to download the files in batches of 200 - 500 pdfs,but am thinking that if I slot them automatically into the archivehierarchy, misplaced or missing files could be very hard to find asthe total grows. I will be logging the downloads, which should giveme some control. Are there other strategies for ensuring that I canreadily correct download errors? I am looking for recommendationsfor the simplest way to maintain reasonable control over thedownload process.
A couple things:

If you already have archive.org identifiers picked out, you can use
something like this python script to download them all from IA:
http://blog.openlibrary.org/2008/11/24/bulk-access-to-ocr-for-1-million-books/

'Sounds fun, and such a project is something I advocate not only forretrospective preservation purposes put for general collectionbuilding as well, but that is another story.

Without some sort of metadata it will not be possible for you to saveyour files in the hierarchy outlined above. State agency. Year. Title.Publication date. One the other hand, if metadata containing thesevalues is readily accessible in the downloaded file itself or, as Edmentioned, a part of some sort of manifest (or MARC record), then youare golden. I used Raj's script as a model for a similar process [1]:


  * write a cool query against Open Library returning identifiers
  * feed identifiers to mirroring program; I used wget
  * download file as well as metadata
  * parse metadata and process associated file accordingly

If you're really luck, then the "cool query" written against OpenLibrary will also return the necessary metadata and you could use thatas a guide to save your file


Good luck.

[1] similar process - 
http://infomotions.com/blog/2009/06/interent-archive-content-in-discovery-systems/

--
Eric Lease Morgan

Re: [CODE4LIB] download design

Reply via email to