Sounds useful to me. On Nov 11, 2010, at 3:37 PM, Isabel Drost wrote:
> On 09.11.2010 Grant Ingersoll wrote: >> Hi Benson, I looked at the site and it seemed interesting, but didn't dig >> any deeper. Can you give a paragraph on what it does? > > If I understood Benson correctly when we were talking at Apache Con, the > library > is meant to provide a means to remove clutter (including navigational content > etc.) from web pages. The original intention was to run it as a clean-up step > before displaying the web page in browsers. However given that any text > processing usually needs cleaning up web pages and extracting the relevant > content as a very first step, the idea was to abuse the methods for automated > text extraction as well. > > As a side note - a project with similar goals was mentioned on the Lucene > mailing lists a while ago: http://code.google.com/p/boilerpipe/ > > Cheers, > Isabel
