Re: [CODE4LIB] eebo [perfect texts]

Eric Lease Morgan Mon, 08 Jun 2015 05:59:29 -0700

On Jun 8, 2015, at 7:32 AM, Owen Stephens <[email protected]> wrote:


> I’ve just seen another interesting take based (mainly) on data in the 
> TCP-EEBO release:
> 
>   
> https://scalablereading.northwestern.edu/2015/06/07/shakespeare-his-contemporaries-shc-released/
> 
> It includes mention of MorphAdorner[1] which does some clever stuff around 
> tagging parts of speech, spelling variations, lemmata etc. and another tool 
> which I hadn’t come across before AnnoLex[2] "for the correction and 
> annotation of lexical data in Early Modern texts”.
> 
> This paper[3] from Alistair Baron and Andrew Hardie at the University of 
> Lancaster in the UK about preparing EEBO-TCP texts for corpus-based analysis 
> may also be of interest, and the team at Lancaster have developed a tool 
> called VARD which supports pre-processing texts[4]
> 
> [1] http://morphadorner.northwestern.edu
> [2] http://annolex.at.northwestern.edu
> [3] http://eprints.lancs.ac.uk/60272/1/Baron_Hardie.pdf
> [4] http://ucrel.lancs.ac.uk/vard/about/


All of this is really very interesting. Really. At the same time, there seems 
to be a WHOLE lot of effort spent on cleaning and normalizing data, and very 
little done to actually analyze it beyond “close reading”. The final goal of 
all these interfaces seem to be refined search. Frankly, I don’t need search. 
And the only community who will want this level of search will be the scholarly 
scholar. “What about the undergraduate student? What about the just more than 
casual reader? What about the engineer?” Most people don’t know how or why 
parts-of-speech are important let alone what a lemma is. Nor do they care. I 
can find plenty of things. I need (want) analysis. Let’s assume the data is 
clean — or rather, accept the fact that there is dirty data akin to the dirty 
data created through OCR and there is nothing a person can do about it — lets 
see some automated comparisons between texts. Examples might include:

  * this one is longer
  * this one is shorter
  * this one includes more action
  * this one discusses such & such theme more than this one
  * so & so theme came and went during a particular time period
  * the meaning of this phrase changed over time
  * the author’s message of this text is…
  * this given play asserts the following facts
  * here is a map illustrating where the protagonist went when
  * a summary of this text includes…
  * this work is fiction
  * this work is non-fiction
  * this work was probably influenced by…

We don’t need perfect texts before analysis can be done. Sure, perfect texts 
help, but they are not necessary. Observations and generalization can be made 
even without perfectly transcribed texts. 

—
ELM

Re: [CODE4LIB] eebo [perfect texts]

Reply via email to