Re: Language specific crawl

Alexander Aristov Tue, 12 Aug 2008 02:24:53 -0700

for fetching you will need all content but you can add indexer pluging to
discard unnesessary stuff.


Implement IndexerFilter

Alex

On 12/08/2008, Samo Kralj <[EMAIL PROTECTED]> wrote:
>
> Hi!
>
>
>
> I'm fairly new with Nutch and what I would like to do is crawl only pages
> of
> specific language. I have successfully enabled language-identifier plugin
> and it identifies languages perfectly.
>
> But now I'm stuck on how to crawl only pages of specific language.
>
> My first idea was to create a postprocess tool (similar to dedup) that
> checks each indexed page and if it has wrong lang attribute deletes it and
> removes all out links. You'd run this tool after every indexing.
>
>
>
> Other idea was to create some kind of filter that discards the page (and
> out
> links) as soon as the language has been identified (in
> LanguageIndexingFilter)?
>
>
>
> Which would be better and what can I take as my starting point?
>
>
>
> Thanks,
>
> Samo Kralj
>
>


-- 
Best Regards
Alexander Aristov

Re: Language specific crawl

Reply via email to