[jira] [Commented] (NUTCH-3056) Injector to support resolving seed URLs

Markus Jelsma (Jira) Mon, 17 Jun 2024 03:28:05 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855579#comment-17855579
 ]


Markus Jelsma commented on NUTCH-3056:
--------------------------------------

Initial 1.15 patch.

Set {color:#000000}db.injector.resolve.urls {color}to true to enable the 
injector's resolver, and use {color:#000000}db.injector.resolve.num.threads 
{color}to control the number of resolver threads. It defaults to 50.

> Injector to support resolving seed URLs
> ---------------------------------------
>
>                 Key: NUTCH-3056
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3056
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.21
>
>         Attachments: NUTCH-3056.patch
>
>
> We have a case where clients submit huge uncurated seed files, the host may 
> not longer exist, or redirect via-via to elsewhere, the protocol may be 
> incorrect etc.
> The large crawl itself is not supposed to venture much beyond the seed list, 
> except for regex exceptions listed in 
> {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed 
> to jump to other domains/hosts to control the size of the crawl. This means 
> externally redirecting seeds will not be crawled.
> This ticket will add support for a multi-threaded 
> host/domain/protocol/redirecter/resolver to the injector. Seeds not leading 
> to a non-200 URL will be discarded. Enabling filtering and normalization is 
> highly recommended for handling the redirects.
> If you have a seed file with 10k+ or millions of records, you are highly 
> recommended to split the input file in chunks so that multiple mappers can 
> get to work. Passing a few millions records without resolving through one 
> mapper is no problem, but resolving millions with one mapper, even if 
> threaded, will take many hours.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3056) Injector to support resolving seed URLs

Reply via email to