On Feb 22, 2011, at 9:02 AM, Cindy Harper wrote:

> It's not ironic - my post was musing inspired by your work.  I guess I wasn't 
> sure if I understood your results. You were looking at the overall POS usage 
> in the entire texts as a possible way of ranking the texts. I was wondering 
> about POS of particular search terms - those that could take on several 
> POS....


Initially I wanted to see if I could classify works based on their POS usage. 
[1] I was hoping to find lots of action verbs in one work and call it an action 
story. I was hoping to find lots of nouns in another story and call it... I 
don't know, something else. Instead, after rudimentary investigation, I 
discovered that all of of the works I analyzed had the same relative percentage 
of nouns, pronouns, verbs, adverbs, adjectives, etc. Maybe such a thing is 
indicative of the English language.

On the other hand, I did notice a difference in the use of particular pronouns 
between works. In Walden by Thoreau, a story about an individual living on the 
banks of a "pond", there was a lot of use of the word "I", but in a different 
story, where the author and his brother canoe down a river, the word "we" 
predominated. Similarly, three Jane Austen stories have many words like "she" 
and "her" where those words are less frequent in the works by Thoreau. While my 
analysis was trivial and thin, I think we might be able to classify some works 
by gender or speaking voice. 

Similar things may be possible with other parts-of-speech, like adjectives, 
specifically colors. For example 214 of the 117,540 words in Walden (0.18%) are 
colors  [1] But only 13  of 121,917 words in Pride and Prejudice (0.01%) are 
color words. Despite the similar lengths of the works, Walden is 18 times more 
"colorful" than Pride. Interesting? This only begs other questions. Is 0.18% a 
high value or a low value? Is the relative use of colors similar within a 
particular author or not? Has the use of color changed over time or indicative 
of genres? Does the use of specific colors actually denote mood?

In the past libraries did not have a whole lot of full text in order to 
evaluate content. That is not true now-a-days. It is now possible to literally 
count and measure a book's characteristics. Since this metadata is numeric in 
nature, it lends itself to visualization. (Think Karen C's presentation at 
Code4Lib.) And this whole thing is good fodder for search, discovery, and 
evaluation. Too much of our metadata is qualitative.


[1] foray's into POS - http://bit.ly/aM2eZx
[2] color words in Walden - http://t.co/hlg5ibL
[3] color words in Pride - http://t.co/VflNf3n

-- 
Eric Lease Morgan

Reply via email to