[ https://issues.apache.org/jira/browse/NUTCH-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-3056: --------------------------------- Description: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. If you have a seed file with 10k+ or millions of records, you are highly recommended to split the input file in chunks so that multiple mappers can get to work. Passing a few millions records without resolving through one mapper is no problem, but resolving millions with one mapper, even if threaded, will take many hours. was: We have a case where clients submit huge uncurated seed files, the host may not longer exist, or redirect via-via to elsewhere, the protocol may be incorrect etc. The large crawl itself is not supposed to venture much beyond the seed list, except for regex exceptions listed in {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to jump to other domains/hosts to control the size of the crawl. This means externally redirecting seeds will not be crawled. This ticket will add support for a multi-threaded host/domain/protocol/redirecter/resolver to the injector. > Injector to support resolving seed URLs > --------------------------------------- > > Key: NUTCH-3056 > URL: https://issues.apache.org/jira/browse/NUTCH-3056 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.21 > > > We have a case where clients submit huge uncurated seed files, the host may > not longer exist, or redirect via-via to elsewhere, the protocol may be > incorrect etc. > The large crawl itself is not supposed to venture much beyond the seed list, > except for regex exceptions listed in > {color:#000000}db-ignore-external-exemptions{color}. It is also not allowed > to jump to other domains/hosts to control the size of the crawl. This means > externally redirecting seeds will not be crawled. > This ticket will add support for a multi-threaded > host/domain/protocol/redirecter/resolver to the injector. > If you have a seed file with 10k+ or millions of records, you are highly > recommended to split the input file in chunks so that multiple mappers can > get to work. Passing a few millions records without resolving through one > mapper is no problem, but resolving millions with one mapper, even if > threaded, will take many hours. -- This message was sent by Atlassian Jira (v8.20.10#820010)