[
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855579#comment-17855579
]
Markus Jelsma commented on NUTCH-3056:
--------------------------------------
Initial 1.15 patch.
Set {color:#000000}db.injector.resolve.urls {color}to true to enable the
injector's resolver, and use {color:#000000}db.injector.resolve.num.threads
{color}to control the number of resolver threads. It defaults to 50.
> Injector to support resolving seed URLs
> ---------------------------------------
>
> Key: NUTCH-3056
> URL: https://issues.apache.org/jira/browse/NUTCH-3056
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3056.patch
>
>
> We have a case where clients submit huge uncurated seed files, the host may
> not longer exist, or redirect via-via to elsewhere, the protocol may be
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list,
> except for regex exceptions listed in
> {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed
> to jump to other domains/hosts to control the size of the crawl. This means
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded
> host/domain/protocol/redirecter/resolver to the injector. Seeds not leading
> to a non-200 URL will be discarded. Enabling filtering and normalization is
> highly recommended for handling the redirects.
> If you have a seed file with 10k+ or millions of records, you are highly
> recommended to split the input file in chunks so that multiple mappers can
> get to work. Passing a few millions records without resolving through one
> mapper is no problem, but resolving millions with one mapper, even if
> threaded, will take many hours.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)