On Jun 12, 2019, at 8:40 PM, Stuart A. Yeates <syea...@gmail.com> wrote:

>> The Distant Reader [0] harvests an arbitrary number of user-supplied files 
>> or links to files, transforms them into plain text files, and performs 
>> numerous natural language processes against them. The result is a large set 
>> of indexes that can be used to "read" the given corpus. I have made 
>> available the about pages of a number of such indexes:
>> 
>>  * Code4Lib Journal - http://dh.crc.nd.edu/tmp/code4lib-journal/about.html
>>     o 1,234,348 words; 303 documents
>>     o all articles from a journal named Code4Lib Journal
> 
> Taking a look at distant reader (which I don't believe I've looked at before):
> 
> (a) It would be great to sanity-check the corpus by running language
> identification on each of the files

Stuart, thank you for the feedback. As of right now, the Distant Reader is only 
designed to process English language materials. Since it (I) rely on a Python 
module called spaCy to do the part-of-speech and named-entity extraction, I 
ought to be able to handle other Romance languages without too much difficulty. 
[1]


> (b) There are a whole flotilla of technical identifiers that could
> useful be extracted from the text files (DOIs, ISBNs, ISSNs, etc)

This is a fun idea, and I will investigate it further.


> (c) A little webification of the texts would go a long way

Hmmm... The plain text versions of the documents are necessary for the natural 
language processing, but instead of returning links to the plain text I could 
return links to the cached versions of the texts which are usually formatted in 
HTML or as PDF. Thus, a part of the reading process would be made easier.


> (d) Has thought been put into making them data archive-friendly?

I don't understand. In this case, what does "archive-friendly" mean?


For a good time, I created a new data set -- 460 love stories (238 million 
words; 460 documents; 5.94 uncompressed GB)

  * about page - 
http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories/about.html
  * data set ("study carrel") - 
http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories.zip

Again, thank you for the feedback.


[0] Distant Reader - https://distantreader.org
[1] spaCy - https://spacy.io/models

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: emor...@nd.edu
w: cds.library.nd.edu

Reply via email to