Christian Kohlschütter wrote:
Dear all,
I think the following announcement is of interest for the Lucene community.
Today I have released Boilerpipe 1.0.
Boilerpipe is a Java library for boilerplate removal and fulltext extraction
from HTML pages.
Hi Christian,
That's excellent news, many users have been asking for a functionality
like this in Nutch and Solr.
In the past I had some success using the strategy described here:
http://article.gmane.org/gmane.comp.search.nutch.devel/25020
but results depended strongly on the type of site. Especially portals
with many "portlet"-type boxes of useful, but repeatable, content were a
nuisance - it was nearly impossible to fix a threshold so that you
skipped repeated boxes, but still caught most of the unique text of the
body. My experience with that algorithm left the impression that there
are no generic page-level (local) methods can work well in such case,
and it can be only solved by using the global-level (site or area of
site) methods, which are more cumbersome to use in practice...
I'm looking forward to experimenting with your implementation!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org