On Jul 1, 2014, at 9:12 AM, Katie <[email protected]> wrote:

> Has anyone here experience in the world of natural language programming 
> (while applying information retrieval techniques)? 
> 
> I'm currently trying to develop a tool that will:
> 
>   1. take a pdf and extract the text (paying no attention to images or 
> formatting)
>   2. analyze the text via term weighting, inverse document frequency, and 
> other natural language processing techniques
>   3. assemble a list of suggested terms and concepts that are weighted 
> heavily in that document
> 
> Step 1 is straightforward and I've had much success there. Step 2 is the 
> problem child. I've played around with a few APIs (like AlchemyAPI) but they 
> have character length limitations or other shortcomings that keep me looking. 
> 
> The background behind this project is that I work for a digital library with 
> a large pre-existing collection of pdfs with rudimentary metadata. The 
> aforementioned tool will be used to classify and group the pdfs according to 
> the themes of the library. Our CMS is Drupal so depending on my level of 
> ambition, this *might* develop into a module.  
> 
> Does this sound like a project that has been done/attempted before? Any 
> suggested tools or reading materials?


You have, more or less, just described my job. Increasingly, I:

  * create or are given a list of citations
  * save the citations as a computer-readable list (database)
  * harvest the full text of the each cited item
  * extract the plain text from the harvested PDF file
  * clean up / post-process the text, maybe
  * do analysis against individual texts or the entire corpus
  * provide interfaces to “read” the corpus from “a distance”

The analysis is akin to descriptive statistics but for “bags of words”. I 
create lists of both frequently use as well as statistically significant 
words/phrases. I do parts-of-speach (POS) analysis and create lists of nouns, 
verbs, adjectives, etc. I then create more lists of the frequently used and 
significant POS. I sometimes do sentiment analysis (or alternative called 
“opinion mining”) against the corpus. Sometimes I index the whole lot and 
provide a search interface. Through named-entity extraction I pull out names of 
people, places, and things. The meaning of these things can then be elaborated 
upon through Wikipedia look-ups. The dates and be plotted on a timeline. I’m 
beginning to get into classification and clustering, but I haven’t seen any 
really exciting things come out of topic modeling, yet. Through all of these 
processes, I am able to supplement the original lists of citations to 
value-added services. What I’m weak at is the visualizations. 

Example projects have included:

  * harvesting “poverty tourism” websites, and learning
    how & why people are convinced to visit slums

  * collecting as many articles from the history of science
    literature as possible, and analyzing how the use of
    word “practice” has changed over time

  * similarly, collecting as many articles from the business
    section of the New York Times to determine how the words
    “tariff” and “trade” have changed over time

  * analyzing how people’s perceptions of culture have
    changed based on pre- and post-descriptions of China

  * collecting and analyzing the transcripts of trials during
    the 17th century to see whether how religion affected commerce

  * finding the common themes in a set of 4th century Catholic
    hymns

  * looking for alternative genres in a corpus of mediaeval
    literature

Trying to determine the significant words of a single document in iscolation is 
difficult. Instead, it is much easier to denote a set of significant words for 
a single document when the document is a part of a corpus. There seems to be 
never-ending and ever subtle differences on how to do this, but exploiting 
TF/IDF is probably one of the more common. [1] Consider also using the cosine 
similarity measure to compare documents for “sameness”. [2] The folks at 
Stanford have a very nice suite of natural language processors. [3] Albeit 
written in Perl, I have created a tiny library of routines and corresponding 
programs do much of this work from the command line of my desktop computer. [4]

[1] TF/IDF - http://en.wikipedia.org/wiki/Tf–idf
[2] similarity - http://en.wikipedia.org/wiki/Cosine_similarity
[3] Stanford tools - http://www-nlp.stanford.edu
[4] tiny library - https://github.com/ericleasemorgan/Tiny-Text-Mining-Tools

—
Eric “Librarians Love Lists" Morgan

Reply via email to