2008/8/12 Marcin Okraszewski <[EMAIL PROTECTED]> > How would you prevent from indexing a page in IndexFilter? As far as I > remember returning null in Nutch 0.9 was causing a crash. I think throwing > exception worked but it was writing stacktrace in log. But I might be wrong. > > Marcin
Just look at the *IndexingFilters* class, the very bottom. From your words, how often then should we expect crashes? Don't know about nutch 0.9 , I use current version from trunk. > > > Dnia 12 sierpnia 2008 11:24 "Alexander Aristov" < > [EMAIL PROTECTED]> napisaĆ(a): > > > for fetching you will need all content but you can add indexer pluging to > > discard unnesessary stuff. > > > > Implement IndexerFilter > > > > Alex > > > > On 12/08/2008, Samo Kralj wrote: > > > > > > Hi! > > > > > > > > > > > > I'm fairly new with Nutch and what I would like to do is crawl only > pages > > > of > > > specific language. I have successfully enabled language-identifier > plugin > > > and it identifies languages perfectly. > > > > > > But now I'm stuck on how to crawl only pages of specific language. > > > > > > My first idea was to create a postprocess tool (similar to dedup) that > > > checks each indexed page and if it has wrong lang attribute deletes it > and > > > removes all out links. You'd run this tool after every indexing. > > > > > > > > > > > > Other idea was to create some kind of filter that discards the page > (and > > > out > > > links) as soon as the language has been identified (in > > > LanguageIndexingFilter)? > > > > > > > > > > > > Which would be better and what can I take as my starting point? > > > > > > > > > > > > Thanks, > > > > > > Samo Kralj > > > > > > > > > > > > -- > > Best Regards > > Alexander Aristov > > > -- Best Regards Alexander Aristov
