Re: Announcement: Boilerplate removal library

Andrzej Bialecki Fri, 04 Dec 2009 00:48:51 -0800

Christian Kohlschütter wrote:

Dear all,


I think the following announcement is of interest for the Lucene community.

Today I have released Boilerpipe 1.0.

Boilerpipe is a Java library for boilerplate removal and fulltext extraction 
from HTML pages.


Hi Christian,

That's excellent news, many users have been asking for a functionalitylike this in Nutch and Solr.


In the past I had some success using the strategy described here:

http://article.gmane.org/gmane.comp.search.nutch.devel/25020

but results depended strongly on the type of site. Especially portalswith many "portlet"-type boxes of useful, but repeatable, content were anuisance - it was nearly impossible to fix a threshold so that youskipped repeated boxes, but still caught most of the unique text of thebody. My experience with that algorithm left the impression that thereare no generic page-level (local) methods can work well in such case,and it can be only solved by using the global-level (site or area ofsite) methods, which are more cumbersome to use in practice...


I'm looking forward to experimenting with your implementation!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Announcement: Boilerplate removal library

Reply via email to