[ 
https://issues.apache.org/jira/browse/NUTCH-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645524#comment-17645524
 ] 

David Smith commented on NUTCH-2973:
------------------------------------

Thank you, though we have dropped Nutch entirely and gone with Norconex as
Nutch issues aren't being actioned in a timely manner.  This was raised on
Oct 20 and is only now being discussed almost 2 months later.  Effectively
we consider Nutch a dead project.

On Sat, 10 Dec 2022 at 07:32, Sebastian Nagel (Jira) <j...@apache.org>



> Single domain names (eg https://localnet) can't be crawled - filtering fails
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-2973
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2973
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.19
>         Environment: Nutch 1.19, checked on Windows 10 and Ubuntu.  Both have 
> the same issue. 
> 'm trying to crawl a SharePoint intranet using nutch where the URLs are 
> similar to:
>  
> {{https://localnet/something.aspx}}
> The issue is that Nutch is rejecting any url with a single element domain 
> name such as localnet above. "localnet.com" is not rejected, nor is 
> "local.localnet". It almost feels as if there's a chunk of code within Nutch 
> that's unrelated to the filtering mechanisms that rejects URLs outright if 
> they don't have a WWW style format and a WWW-style domain such as .COM
> Error message:
>  
> {{Total urls rejected by filters: 1}}
> I've checked and updated all the _filter_ files in the conf directory. Even 
> making then incredibly permissive (effectively "crawl everything") has not 
> helped.
>            Reporter: David Smith
>            Priority: Blocker
>
> There appears to be a bug within the core of Nutch that fails to permit any 
> single domain name URLs to be crawled.  Example:
> {{https://{*}localnet{*}/something.aspx}}
> The issue is that Nutch is rejecting any url with a single element domain 
> name such as *localnet* above. "localnet.com" is not rejected, nor is 
> "local.localnet". It almost feels as if there's a chunk of code within Nutch 
> that's unrelated to the filtering mechanisms that rejects URLs outright if 
> they don't have a WWW style format and a WWW-style domain such as .COM
> Error message:
> {{Total urls rejected by filters: 1}}
> I've checked and updated all the filter files in the conf directory. Even 
> making then incredibly permissive (effectively "crawl everything") has not 
> helped.    Immediately that a dot (.) is added to the domain name it is not 
> rejected - eg blah.localnet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to