On 09.11.2010 Grant Ingersoll wrote:
> Hi Benson, I looked at the site and it seemed interesting, but didn't dig
> any deeper.  Can you give a paragraph on what it does?

If I understood Benson correctly when we were talking at Apache Con, the 
library 
is meant to provide a means to remove clutter (including navigational content 
etc.) from web pages. The original intention was to run it as a clean-up step 
before displaying the web page in browsers. However given that any text 
processing usually needs cleaning up web pages and extracting the relevant 
content as a very first step, the idea was to abuse the methods for automated 
text extraction as well.

As a side note - a project with similar goals was mentioned on the Lucene 
mailing lists a while ago: http://code.google.com/p/boilerpipe/

Cheers,
Isabel

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to