Markus Jelsma created NUTCH-3056:
------------------------------------
Summary: Injector to support resolving seed URLs
Key: NUTCH-3056
URL: https://issues.apache.org/jira/browse/NUTCH-3056
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.21
We have a case where clients submit huge uncurated seed files, the host may not
longer exist, or redirect via-via to elsewhere, the protocol may be incorrect
etc.
The large crawl itself is not supposed to venture much beyond the seed list,
except for regex exceptions listed in
{color:#000000}db-ignore-external-exemptions{color}. It is also not allowed to
jump to other domains/hosts to control the size of the crawl. This means
externally redirecting seeds will not be crawled.
This ticket will add support for a multi-threaded
host/domain/protocol/redirecter/resolver to the injector.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)