for fetching you will need all content but you can add indexer pluging to discard unnesessary stuff.
Implement IndexerFilter Alex On 12/08/2008, Samo Kralj <[EMAIL PROTECTED]> wrote: > > Hi! > > > > I'm fairly new with Nutch and what I would like to do is crawl only pages > of > specific language. I have successfully enabled language-identifier plugin > and it identifies languages perfectly. > > But now I'm stuck on how to crawl only pages of specific language. > > My first idea was to create a postprocess tool (similar to dedup) that > checks each indexed page and if it has wrong lang attribute deletes it and > removes all out links. You'd run this tool after every indexing. > > > > Other idea was to create some kind of filter that discards the page (and > out > links) as soon as the language has been identified (in > LanguageIndexingFilter)? > > > > Which would be better and what can I take as my starting point? > > > > Thanks, > > Samo Kralj > > -- Best Regards Alexander Aristov
