Sounds useful to me.

On Nov 11, 2010, at 3:37 PM, Isabel Drost wrote:

> On 09.11.2010 Grant Ingersoll wrote:
>> Hi Benson, I looked at the site and it seemed interesting, but didn't dig
>> any deeper.  Can you give a paragraph on what it does?
> 
> If I understood Benson correctly when we were talking at Apache Con, the 
> library 
> is meant to provide a means to remove clutter (including navigational content 
> etc.) from web pages. The original intention was to run it as a clean-up step 
> before displaying the web page in browsers. However given that any text 
> processing usually needs cleaning up web pages and extracting the relevant 
> content as a very first step, the idea was to abuse the methods for automated 
> text extraction as well.
> 
> As a side note - a project with similar goals was mentioned on the Lucene 
> mailing lists a while ago: http://code.google.com/p/boilerpipe/
> 
> Cheers,
> Isabel

Reply via email to