[ https://issues.apache.org/jira/browse/NUTCH-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645524#comment-17645524 ]
David Smith commented on NUTCH-2973: ------------------------------------ Thank you, though we have dropped Nutch entirely and gone with Norconex as Nutch issues aren't being actioned in a timely manner. This was raised on Oct 20 and is only now being discussed almost 2 months later. Effectively we consider Nutch a dead project. On Sat, 10 Dec 2022 at 07:32, Sebastian Nagel (Jira) <j...@apache.org> > Single domain names (eg https://localnet) can't be crawled - filtering fails > ---------------------------------------------------------------------------- > > Key: NUTCH-2973 > URL: https://issues.apache.org/jira/browse/NUTCH-2973 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.19 > Environment: Nutch 1.19, checked on Windows 10 and Ubuntu. Both have > the same issue. > 'm trying to crawl a SharePoint intranet using nutch where the URLs are > similar to: > > {{https://localnet/something.aspx}} > The issue is that Nutch is rejecting any url with a single element domain > name such as localnet above. "localnet.com" is not rejected, nor is > "local.localnet". It almost feels as if there's a chunk of code within Nutch > that's unrelated to the filtering mechanisms that rejects URLs outright if > they don't have a WWW style format and a WWW-style domain such as .COM > Error message: > > {{Total urls rejected by filters: 1}} > I've checked and updated all the _filter_ files in the conf directory. Even > making then incredibly permissive (effectively "crawl everything") has not > helped. > Reporter: David Smith > Priority: Blocker > > There appears to be a bug within the core of Nutch that fails to permit any > single domain name URLs to be crawled. Example: > {{https://{*}localnet{*}/something.aspx}} > The issue is that Nutch is rejecting any url with a single element domain > name such as *localnet* above. "localnet.com" is not rejected, nor is > "local.localnet". It almost feels as if there's a chunk of code within Nutch > that's unrelated to the filtering mechanisms that rejects URLs outright if > they don't have a WWW style format and a WWW-style domain such as .COM > Error message: > {{Total urls rejected by filters: 1}} > I've checked and updated all the filter files in the conf directory. Even > making then incredibly permissive (effectively "crawl everything") has not > helped. Immediately that a dot (.) is added to the domain name it is not > rejected - eg blah.localnet. -- This message was sent by Atlassian Jira (v8.20.10#820010)