Hi!
I'm fairly new with Nutch and what I would like to do is crawl only pages of specific language. I have successfully enabled language-identifier plugin and it identifies languages perfectly. But now I'm stuck on how to crawl only pages of specific language. My first idea was to create a postprocess tool (similar to dedup) that checks each indexed page and if it has wrong lang attribute deletes it and removes all out links. You'd run this tool after every indexing. Other idea was to create some kind of filter that discards the page (and out links) as soon as the language has been identified (in LanguageIndexingFilter)? Which would be better and what can I take as my starting point? Thanks, Samo Kralj
