Massimo Miccoli wrote:
Hi nutch dev,

After fetching about 100 mio of pages I see many search engine spammers
that use an hidden div tag (negative position) to include many urls
that user don't see whe acces the site page. This links alter the boost
(by inlink count) so I want to skip this urls.
How can I do that?

Implement an HtmlParseFilter, similar to creativecommons plugin. This plugin will remove matching tags.

In fact, if you have some spare cycles, you could implement a more generic "html cleanup" plugin, where you could specify a list of XPaths to match (and optionally replace).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to