AFAIK it is not possible to crawl only a specific language. You would
have to fetch the content first, applying the language-identifier plugin
as you have done. If you want to drop off outlinks from pages not in
your target language from being updated into the crawldb and later
fetched you would need to change CrawlDbReducer to check for language
during the update process or you would need to write a custom MR job to
filter the segments after fetching and parsing has occurred but before
the content is updated to crawldb.
Although it is a little hacky you could also change fetcher around line
358 or so and not have it collect the segment output unless the page is
in your target language.
Dennis
Samo Kralj wrote:
Thank you for your fast reply! Just one more question before I dive into
code...
Is it possible to discard/delete outgoing links of current page from indexer
filter?
Pages of one language often lead to pages in the same language, so it would
be wise to remove links if they come from discarded page.
-----Original Message-----
From: Alexander Aristov [mailto:[EMAIL PROTECTED]
Sent: 12. avgust 2008 11:24
To: [email protected]
Subject: Re: Language specific crawl
for fetching you will need all content but you can add indexer pluging to
discard unnesessary stuff.
Implement IndexerFilter
Alex
On 12/08/2008, Samo Kralj <[EMAIL PROTECTED]> wrote:
Hi!
I'm fairly new with Nutch and what I would like to do is crawl only pages
of
specific language. I have successfully enabled language-identifier plugin
and it identifies languages perfectly.
But now I'm stuck on how to crawl only pages of specific language.
My first idea was to create a postprocess tool (similar to dedup) that
checks each indexed page and if it has wrong lang attribute deletes it and
removes all out links. You'd run this tool after every indexing.
Other idea was to create some kind of filter that discards the page (and
out
links) as soon as the language has been identified (in
LanguageIndexingFilter)?
Which would be better and what can I take as my starting point?
Thanks,
Samo Kralj