Am 04.12.2009 um 09:48 schrieb Andrzej Bialecki: > Christian Kohlschütter wrote: >> Dear all, >> I think the following announcement is of interest for the Lucene community. >> Today I have released Boilerpipe 1.0. >> Boilerpipe is a Java library for boilerplate removal and fulltext extraction >> from HTML pages. > > Hi Christian, > > That's excellent news, many users have been asking for a functionality like > this in Nutch and Solr. > > In the past I had some success using the strategy described here: > > http://article.gmane.org/gmane.comp.search.nutch.devel/25020 > > but results depended strongly on the type of site. Especially portals with > many "portlet"-type boxes of useful, but repeatable, content were a nuisance > - it was nearly impossible to fix a threshold so that you skipped repeated > boxes, but still caught most of the unique text of the body. My experience > with that algorithm left the impression that there are no generic page-level > (local) methods can work well in such case, and it can be only solved by > using the global-level (site or area of site) methods, which are more > cumbersome to use in practice... > > I'm looking forward to experimenting with your implementation!
Hi Andrzej, thanks for this feedback! The algorithm you describe sounds a bit like BTE (which I have evaluated in my paper). At least for news sites my strategies outperformed BTE. Generally, all extractors may fail for particular pages, so it may be relatively easy to craft a bad example. But if it works 95% of the time, we can be more than happy (check Figure 2b in my paper, it's linked from the boilerpipe homepage). If you could send me (by private email) some example pages that failed for the other algorithm, I can see how it works with boilerpipe and tune the code if necessary. Cheers, Christian -- Christian Kohlschütter kohlschuet...@l3s.de Forschungszentrum L3S Leibniz Universität Hannover http://www.L3S.de/~kohlschuetter/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org