On 09.11.2010 Grant Ingersoll wrote: > Hi Benson, I looked at the site and it seemed interesting, but didn't dig > any deeper. Can you give a paragraph on what it does?
If I understood Benson correctly when we were talking at Apache Con, the library is meant to provide a means to remove clutter (including navigational content etc.) from web pages. The original intention was to run it as a clean-up step before displaying the web page in browsers. However given that any text processing usually needs cleaning up web pages and extracting the relevant content as a very first step, the idea was to abuse the methods for automated text extraction as well. As a side note - a project with similar goals was mentioned on the Lucene mailing lists a while ago: http://code.google.com/p/boilerpipe/ Cheers, Isabel
signature.asc
Description: This is a digitally signed message part.
