Re: Language specific crawl

Alexander Aristov Tue, 12 Aug 2008 08:18:09 -0700

2008/8/12 Marcin Okraszewski <[EMAIL PROTECTED]>

> How would you prevent from indexing a page in IndexFilter? As far as I
> remember returning null in Nutch 0.9 was causing a crash. I think throwing
> exception worked but it was writing stacktrace in log. But I might be wrong.
>
> Marcin




Just look at the *IndexingFilters* class, the very bottom. From your words,
how often then should we expect crashes? Don't know about nutch 0.9 , I use
current version from trunk.



>
>
> Dnia 12 sierpnia 2008 11:24 "Alexander Aristov" <
> [EMAIL PROTECTED]> napisał(a):
>
> > for fetching you will need all content but you can add indexer pluging to
> > discard unnesessary stuff.
> >
> > Implement IndexerFilter
> >
> > Alex
> >
> > On 12/08/2008, Samo Kralj  wrote:
> > >
> > > Hi!
> > >
> > >
> > >
> > > I'm fairly new with Nutch and what I would like to do is crawl only
> pages
> > > of
> > > specific language. I have successfully enabled language-identifier
> plugin
> > > and it identifies languages perfectly.
> > >
> > > But now I'm stuck on how to crawl only pages of specific language.
> > >
> > > My first idea was to create a postprocess tool (similar to dedup) that
> > > checks each indexed page and if it has wrong lang attribute deletes it
> and
> > > removes all out links. You'd run this tool after every indexing.
> > >
> > >
> > >
> > > Other idea was to create some kind of filter that discards the page
> (and
> > > out
> > > links) as soon as the language has been identified (in
> > > LanguageIndexingFilter)?
> > >
> > >
> > >
> > > Which would be better and what can I take as my starting point?
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Samo Kralj
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
>



-- 
Best Regards
Alexander Aristov

Re: Language specific crawl

Reply via email to