[CODE4LIB] internet archive scholar or archived file from wayback machine

Eric Lease Morgan Mon, 22 Mar 2021 10:04:09 -0700

Is there anybody here who works for Internet Archive Scholar, or can somebody 
tell me how I might be able to download an archived file from the Wayback 
Machine?

A couple of weeks ago I learned about Internet Archive Scholar. [1] This is an
index of scholarly content harvested from the 'Net. It is possible to query
Scholar and get back JSON, and the JSON is full of cool and interesting
bibliographic data. Here is a snippet of the JSON, and it describes the full
text of an item:

"fulltext": {
"file_mimetype": "application/pdf",
"access_type": "wayback",
"file_sha1": "c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d",
"size_bytes": 1467536,
"file_ident": "sddxyle4qzaz5lq3hexyw2h4my",
"access_url":
"https://web.archive.org/web/20190516143829/https://aibstudi.aib.it/article/download/11501/10805";,
"release_ident": "tu5d2xp53jg3pmwa4lyjbtj45m",
"thumbnail_url":
"https://blobs.fatcat.wiki/thumbnail/pdf/c3/f8/c3f8851bcae9fdfb4ee97d2b1960010ce8b3281d.180px.jpg";
},

I can parse the value of access_url to get a URL, but because of the nature of
the 'Net, the URLs are broken about 33% of the time (antidotally speaking).
Yes, I can use the full access_url, but this returns an HTML page with the
something inside an iframe, I think. I want the actual thing, not a
splash/landing/metadata page.

Is there a way to programmatically reverse engineer the value of access_url
(sans screen scraping) and get back a URL pointing to the item?

By the way, the Internet Archive Scholar is pretty nifty. You can query the
index, get back a bucket o' JSON, parse the JSON and pour it in a database,
query the database, harvest the full text of items, and then send the result
off to my Reader. This morning I used the query "Henry David Thoreau",
downloaded almost 1,600 journal articles, and proceeded to "read" them. The
whole process -- from beginning to end -- took about twenty minutes. There no
way one can search for, download, and "read" 1,600 articles from a vended index.

Again, the value of access_url returns an HTML page, but what I really want is
the thing in-and-of itself. Is there a way to do this?

[1] Internet Archive Scholar - https://scholar.archive.org/about

--
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: [email protected]
w: cds.library.nd.edu

[CODE4LIB] internet archive scholar or archived file from wayback machine

Reply via email to