localnet) can't be crawled - filtering fails

David Smith (Jira) Wed, 19 Oct 2022 19:06:04 -0700

David Smith created NUTCH-2973:
----------------------------------

             Summary: Single domain names (eg https://localnet) can't be 
crawled - filtering fails
                 Key: NUTCH-2973
                 URL: https://issues.apache.org/jira/browse/NUTCH-2973
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.19
         Environment: Nutch 1.19, checked on Windows 10 and Ubuntu.  Both have 
the same issue.


'm trying to crawl a SharePoint intranet using nutch where the URLs are similar 
to:

 

{{https://localnet/something.aspx}}

The issue is that Nutch is rejecting any url with a single element domain name 
such as localnet above. "localnet.com" is not rejected, nor is 
"local.localnet". It almost feels as if there's a chunk of code within Nutch 
that's unrelated to the filtering mechanisms that rejects URLs outright if they 
don't have a WWW style format and a WWW-style domain such as .COM

Error message:

 

{{Total urls rejected by filters: 1}}

I've checked and updated all the _filter_ files in the conf directory. Even 
making then incredibly permissive (effectively "crawl everything") has not 
helped.
            Reporter: David Smith


There appears to be a bug within the core of Nutch that fails to permit any 
single domain name URLs to be crawled.  Example:

{{https://{*}localnet{*}/something.aspx}}

The issue is that Nutch is rejecting any url with a single element domain name 
such as *localnet* above. "localnet.com" is not rejected, nor is 
"local.localnet". It almost feels as if there's a chunk of code within Nutch 
that's unrelated to the filtering mechanisms that rejects URLs outright if they 
don't have a WWW style format and a WWW-style domain such as .COM

Error message:

{{Total urls rejected by filters: 1}}

I've checked and updated all the filter files in the conf directory. Even 
making then incredibly permissive (effectively "crawl everything") has not 
helped.    Immediately that a dot (.) is added to the domain name it is not 
rejected - eg blah.localnet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NUTCH-2973) Single domain names (eg https://localnet) can't be crawled - filtering fails

Reply via email to