Hello,

Has anyone here experience in the world of natural language programming (while 
applying information retrieval techniques)? 

I'm currently trying to develop a tool that will:

1. take a pdf and extract the text (paying no attention to images or formatting)
2. analyze the text via term weighting, inverse document frequency, and other 
natural language processing techniques
3. assemble a list of suggested terms and concepts that are weighted heavily in 
that document

Step 1 is straightforward and I've had much success there. Step 2 is the 
problem child. I've played around with a few APIs (like AlchemyAPI) but they 
have character length limitations or other shortcomings that keep me looking. 

The background behind this project is that I work for a digital library with a 
large pre-existing collection of pdfs with rudimentary metadata. The 
aforementioned tool will be used to classify and group the pdfs according to 
the themes of the library. Our CMS is Drupal so depending on my level of 
ambition, this *might* develop into a module.  

Does this sound like a project that has been done/attempted before? Any 
suggested tools or reading materials?

Reply via email to