Christian Kohlschütter wrote:
Dear all,

I think the following announcement is of interest for the Lucene community.

Today I have released Boilerpipe 1.0.

Boilerpipe is a Java library for boilerplate removal and fulltext extraction 
from HTML pages.

Hi Christian,

That's excellent news, many users have been asking for a functionality like this in Nutch and Solr.

In the past I had some success using the strategy described here:

http://article.gmane.org/gmane.comp.search.nutch.devel/25020

but results depended strongly on the type of site. Especially portals with many "portlet"-type boxes of useful, but repeatable, content were a nuisance - it was nearly impossible to fix a threshold so that you skipped repeated boxes, but still caught most of the unique text of the body. My experience with that algorithm left the impression that there are no generic page-level (local) methods can work well in such case, and it can be only solved by using the global-level (site or area of site) methods, which are more cumbersome to use in practice...

I'm looking forward to experimenting with your implementation!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to