Is there anybody here who works for Internet Archive Scholar, or can somebody 
tell me how I might be able to download an archived file from the Wayback 
Machine?

A couple of weeks ago I learned about Internet Archive Scholar. [1] This is an 
index of scholarly content harvested from the 'Net. It is possible to query 
Scholar and get back JSON, and the JSON is full of cool and interesting 
bibliographic data. Here is a snippet of the JSON, and it describes the full 
text of an item:

  "fulltext": {
        "file_mimetype": "application/pdf",
        "access_type": "wayback",
        "file_sha1": "c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d",
        "size_bytes": 1467536,
        "file_ident": "sddxyle4qzaz5lq3hexyw2h4my",
        "access_url": 
"https://web.archive.org/web/20190516143829/https://aibstudi.aib.it/article/download/11501/10805";,
        "release_ident": "tu5d2xp53jg3pmwa4lyjbtj45m",
        "thumbnail_url": 
"https://blobs.fatcat.wiki/thumbnail/pdf/c3/f8/c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d.180px.jpg";
  },

I can parse the value of access_url to get a URL, but because of the nature of 
the 'Net, the URLs are broken about 33% of the time (antidotally speaking). 
Yes, I can use the full access_url, but this returns an HTML page with the 
something inside an iframe, I think. I want the actual thing, not a 
splash/landing/metadata page. 

Is there a way to programmatically reverse engineer the value of access_url 
(sans screen scraping) and get back a URL pointing to the item? 

By the way, the Internet Archive Scholar is pretty nifty. You can query the 
index, get back a bucket o' JSON, parse the JSON and pour it in a database, 
query the database, harvest the full text of items, and then send the result 
off to my Reader. This morning I used the query "Henry David Thoreau", 
downloaded almost 1,600 journal articles, and proceeded to "read" them. The 
whole process -- from beginning to end -- took about twenty minutes. There no 
way one can search for, download, and "read" 1,600 articles from a vended index.

Again, the value of access_url returns an HTML page, but what I really want is 
the thing in-and-of itself. Is there a way to do this?


[1] Internet Archive Scholar - https://scholar.archive.org/about

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: [email protected]
w: cds.library.nd.edu

Reply via email to