David Smith created NUTCH-2973:
----------------------------------
Summary: Single domain names (eg https://localnet) can't be
crawled - filtering fails
Key: NUTCH-2973
URL: https://issues.apache.org/jira/browse/NUTCH-2973
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.19
Environment: Nutch 1.19, checked on Windows 10 and Ubuntu. Both have
the same issue.
'm trying to crawl a SharePoint intranet using nutch where the URLs are similar
to:
{{https://localnet/something.aspx}}
The issue is that Nutch is rejecting any url with a single element domain name
such as localnet above. "localnet.com" is not rejected, nor is
"local.localnet". It almost feels as if there's a chunk of code within Nutch
that's unrelated to the filtering mechanisms that rejects URLs outright if they
don't have a WWW style format and a WWW-style domain such as .COM
Error message:
{{Total urls rejected by filters: 1}}
I've checked and updated all the _filter_ files in the conf directory. Even
making then incredibly permissive (effectively "crawl everything") has not
helped.
Reporter: David Smith
There appears to be a bug within the core of Nutch that fails to permit any
single domain name URLs to be crawled. Example:
{{https://{*}localnet{*}/something.aspx}}
The issue is that Nutch is rejecting any url with a single element domain name
such as *localnet* above. "localnet.com" is not rejected, nor is
"local.localnet". It almost feels as if there's a chunk of code within Nutch
that's unrelated to the filtering mechanisms that rejects URLs outright if they
don't have a WWW style format and a WWW-style domain such as .COM
Error message:
{{Total urls rejected by filters: 1}}
I've checked and updated all the filter files in the conf directory. Even
making then incredibly permissive (effectively "crawl everything") has not
helped. Immediately that a dot (.) is added to the domain name it is not
rejected - eg blah.localnet.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)