Hi Katie,
I've been playing with natural language processing in both Python and R. There 
are lots of books and webpages out there with advice but for me, it's easy to 
get sucked into doing a manipulation that you *can* do instead of what you 
*should do* to answer your research question (in my case) or for business 
purposes (sounds like your case).

I just saw the post mentioning Perl - from what I've seen, it looks a lot 
easier in Python with NLTK and other packages.

Christina

------
Christina K. Pikas
Librarian
The Johns Hopkins University Applied Physics Laboratory
Baltimore: 443.778.4812
D.C.: 240.228.4812
[email protected]




-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Katie
Sent: Tuesday, July 01, 2014 9:13 AM
To: [email protected]
Subject: [CODE4LIB] Natural language programming

Hello,

Has anyone here experience in the world of natural language programming (while 
applying information retrieval techniques)? 

I'm currently trying to develop a tool that will:

1. take a pdf and extract the text (paying no attention to images or 
formatting) 2. analyze the text via term weighting, inverse document frequency, 
and other natural language processing techniques 3. assemble a list of 
suggested terms and concepts that are weighted heavily in that document

Step 1 is straightforward and I've had much success there. Step 2 is the 
problem child. I've played around with a few APIs (like AlchemyAPI) but they 
have character length limitations or other shortcomings that keep me looking. 

The background behind this project is that I work for a digital library with a 
large pre-existing collection of pdfs with rudimentary metadata. The 
aforementioned tool will be used to classify and group the pdfs according to 
the themes of the library. Our CMS is Drupal so depending on my level of 
ambition, this *might* develop into a module.  

Does this sound like a project that has been done/attempted before? Any 
suggested tools or reading materials?

Reply via email to