On Jul 1, 2014, at 9:12 AM, Katie <[email protected]> wrote:
> Has anyone here experience in the world of natural language programming
> (while applying information retrieval techniques)?
>
> I'm currently trying to develop a tool that will:
>
> 1. take a pdf and extract the text (paying no attention to images or
> formatting)
> 2. analyze the text via term weighting, inverse document frequency, and
> other natural language processing techniques
> 3. assemble a list of suggested terms and concepts that are weighted
> heavily in that document
>
> Step 1 is straightforward and I've had much success there. Step 2 is the
> problem child. I've played around with a few APIs (like AlchemyAPI) but they
> have character length limitations or other shortcomings that keep me looking.
>
> The background behind this project is that I work for a digital library with
> a large pre-existing collection of pdfs with rudimentary metadata. The
> aforementioned tool will be used to classify and group the pdfs according to
> the themes of the library. Our CMS is Drupal so depending on my level of
> ambition, this *might* develop into a module.
>
> Does this sound like a project that has been done/attempted before? Any
> suggested tools or reading materials?
You have, more or less, just described my job. Increasingly, I:
* create or are given a list of citations
* save the citations as a computer-readable list (database)
* harvest the full text of the each cited item
* extract the plain text from the harvested PDF file
* clean up / post-process the text, maybe
* do analysis against individual texts or the entire corpus
* provide interfaces to “read” the corpus from “a distance”
The analysis is akin to descriptive statistics but for “bags of words”. I
create lists of both frequently use as well as statistically significant
words/phrases. I do parts-of-speach (POS) analysis and create lists of nouns,
verbs, adjectives, etc. I then create more lists of the frequently used and
significant POS. I sometimes do sentiment analysis (or alternative called
“opinion mining”) against the corpus. Sometimes I index the whole lot and
provide a search interface. Through named-entity extraction I pull out names of
people, places, and things. The meaning of these things can then be elaborated
upon through Wikipedia look-ups. The dates and be plotted on a timeline. I’m
beginning to get into classification and clustering, but I haven’t seen any
really exciting things come out of topic modeling, yet. Through all of these
processes, I am able to supplement the original lists of citations to
value-added services. What I’m weak at is the visualizations.
Example projects have included:
* harvesting “poverty tourism” websites, and learning
how & why people are convinced to visit slums
* collecting as many articles from the history of science
literature as possible, and analyzing how the use of
word “practice” has changed over time
* similarly, collecting as many articles from the business
section of the New York Times to determine how the words
“tariff” and “trade” have changed over time
* analyzing how people’s perceptions of culture have
changed based on pre- and post-descriptions of China
* collecting and analyzing the transcripts of trials during
the 17th century to see whether how religion affected commerce
* finding the common themes in a set of 4th century Catholic
hymns
* looking for alternative genres in a corpus of mediaeval
literature
Trying to determine the significant words of a single document in iscolation is
difficult. Instead, it is much easier to denote a set of significant words for
a single document when the document is a part of a corpus. There seems to be
never-ending and ever subtle differences on how to do this, but exploiting
TF/IDF is probably one of the more common. [1] Consider also using the cosine
similarity measure to compare documents for “sameness”. [2] The folks at
Stanford have a very nice suite of natural language processors. [3] Albeit
written in Perl, I have created a tiny library of routines and corresponding
programs do much of this work from the command line of my desktop computer. [4]
[1] TF/IDF - http://en.wikipedia.org/wiki/Tf–idf
[2] similarity - http://en.wikipedia.org/wiki/Cosine_similarity
[3] Stanford tools - http://www-nlp.stanford.edu
[4] tiny library - https://github.com/ericleasemorgan/Tiny-Text-Mining-Tools
—
Eric “Librarians Love Lists" Morgan