[ 
https://issues.apache.org/jira/browse/NUTCH-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028418#comment-18028418
 ] 

Isabelle Giguere commented on NUTCH-3099:
-----------------------------------------

Hi, [~lewismc]
It turns out I had time to look into this.  Patch attached, including unit test.

I tested crawling https://nutch.apache.org using tinyproxy:
- exact match (as currently supported)
- prefix '*' : "*.apache.org"
- suffix '*' : "nutch.*"

Note that depending on crawling options and config, https://nutch.apache.org 
can yield pages that do not correpond to the same domain name, so the tinyproxy 
logs can show activity for these pages. (ex: solr.apache.org, tika.apache.org, 
ci-builds.apache.org, en.wikipedia.org, github.com, www.elastic.co)

> Allow wildcard '*' in http.proxy.exception.list
> -----------------------------------------------
>
>                 Key: NUTCH-3099
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3099
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: 1.20
>            Reporter: Isabelle Giguere
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.22
>
>         Attachments: NUTCH-3099.2025-10-08.patch.txt
>
>
> The Nutch setting "http.proxy.exception.list" should accept the '*' wildcards.
> The equivalent JVM property "http.nonProxyHosts" does allow '*' at the start 
> or end of a host name.
> https://docs.oracle.com/javase/8/docs/technotes/guides/net/proxies.html
> Note that starting Nutch with -Dhttp.nonProxyHosts="some.host" has no effect, 
> crawling goes through the proxy anyways.  Only "http.proxy.exception.list" 
> can be used with Nutch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to