Hi!

 

I'm fairly new with Nutch and what I would like to do is crawl only pages of
specific language. I have successfully enabled language-identifier plugin
and it identifies languages perfectly.

But now I'm stuck on how to crawl only pages of specific language.

My first idea was to create a postprocess tool (similar to dedup) that
checks each indexed page and if it has wrong lang attribute deletes it and
removes all out links. You'd run this tool after every indexing.

 

Other idea was to create some kind of filter that discards the page (and out
links) as soon as the language has been identified (in
LanguageIndexingFilter)?

 

Which would be better and what can I take as my starting point?

 

Thanks,

Samo Kralj

Reply via email to