Eric, what happens if you access this from a non-HT institution? When I
go to HT I am often unable to download public domain titles because they
aren't available to members of the general public.
kc
On 5/26/15 8:30 AM, Eric Lease Morgan wrote:
In my copious spare time I have hacked together a thing I’m calling the
HathiTrust Research Center Workset Browser, a (fledgling) tool for doing
“distant reading” against corpora from the HathiTrust. [1]
The idea is to: 1) create, refine, or identify a HathiTrust Research Center
workset of interest — your corpus, 2) feed the workset’s rsync file to the
Browser, 3) have the Browser download, index, and analyze the corpus, and 4)
enable to reader to search, browse, and interact with the result of the
analysis. With varying success, I have done this with a number of worksets
ranging on topics from literature, philosophy, Rome, and cookery. The best
working examples are the ones from Thoreau and Austen. [2, 3] The others are
still buggy.
As a further example, the Browser can/will create reports describing the corpus
as a whole. This analysis includes the size of a corpus measured in pages as
well as words, date ranges, word frequencies, and selected items of interest
based on pre-set “themes” — usage of color words, name of “great” authors, and
a set of timeless ideas. [4] This report is based on more fundamental reports
such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]
The whole thing is written in a combination of shell and Python scripts. It
should run on just about any out-of-the-box Linux or Macintosh computer. Take a
look at the code. [9] No special libraries needed. (“Famous last words.”) In
its current state, it is very Unix-y. Everything is done from the command line.
Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a
Renaissance cartoon, the Browser, in its current state, is only a sketch. Only
later will a more full-bodied, Web-based interface be created.
The next steps are numerous and listed in no priority order: putting the whole
thing on GitHub, outputting the reports in generic formats so other things can
easily read them, improving the terminal-based search interface, implementing a
Web-based search interface, writing advanced programs in R that chart and graph
analysis, provide a means for comparing & contrasting two or more items from a
corpus, indexing the corpus with a (real) indexer such as Solr, writing a
“cookbook” describing how to use the browser to to “kewl” things, making the
metadata of corpora available as Linked Data, etc.
'Want to give it a try? For a limited period of time, go to the HathiTrust
Research Center Portal, create (refine or identify) a collection of personal
interest, use the Algorithms tool to export the collection's rsync file, and
send the file to me. I will feed the rsync file to the Browser, and then send
you the URL pointing to the results. [10] Let’s see what happens.
Fun with public domain content, text mining, and the definition of
librarianship.
Links
[1] HTRC Workset Browser - http://bit.ly/workset-browser
[2] Thoreau - http://bit.ly/browser-thoreau
[3] Austen - http://bit.ly/browser-austen
[4] Thoreau report - http://ntrda.me/1LD3xds
[5] Thoreau dictionary (frequency list) - http://bit.ly/thoreau-dictionary
[6] usage of color words in Thoreau — http://bit.ly/thoreau-colors
[7] unique words in the corpus - http://bit.ly/thoreau-unique
[8] Thoreau “catalog” — http://bit.ly/thoreau-catalog
[9] source code - http://ntrda.me/1Q8pPoI
[10] HathiTrust Research Center - https://sharc.hathitrust.org
—
Eric Lease Morgan, Librarian
University of Notre Dame
--
Karen Coyle
[email protected] http://kcoyle.net
m: +1-510-435-8234
skype: kcoylenet/+1-510-984-3600