Am 04.12.2009 um 09:48 schrieb Andrzej Bialecki:

> Christian Kohlschütter wrote:
>> Dear all,
>> I think the following announcement is of interest for the Lucene community.
>> Today I have released Boilerpipe 1.0.
>> Boilerpipe is a Java library for boilerplate removal and fulltext extraction 
>> from HTML pages.
> 
> Hi Christian,
> 
> That's excellent news, many users have been asking for a functionality like 
> this in Nutch and Solr.
> 
> In the past I had some success using the strategy described here:
> 
> http://article.gmane.org/gmane.comp.search.nutch.devel/25020
> 
> but results depended strongly on the type of site. Especially portals with 
> many "portlet"-type boxes of useful, but repeatable, content were a nuisance 
> - it was nearly impossible to fix a threshold so that you skipped repeated 
> boxes, but still caught most of the unique text of the body. My experience 
> with that algorithm left the impression that there are no generic page-level 
> (local) methods can work well in such case, and it can be only solved by 
> using the global-level (site or area of site) methods, which are more 
> cumbersome to use in practice...
> 
> I'm looking forward to experimenting with your implementation!


Hi Andrzej,

thanks for this feedback!

The algorithm you describe sounds a bit like BTE (which I have evaluated in my 
paper). At least for news sites my strategies outperformed BTE.

Generally, all extractors may fail for particular pages, so it may be 
relatively easy to craft a bad example. But if it works 95% of the time, we can 
be more than happy (check Figure 2b in my paper, it's linked from the 
boilerpipe homepage).

If you could send me (by private email) some example pages that failed for the 
other algorithm, I can see how it works with boilerpipe and tune the code if 
necessary.

Cheers,
Christian
-- 
Christian Kohlschütter
kohlschuet...@l3s.de

Forschungszentrum L3S
Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to