[jira] [Updated] (NUTCH-208) http: proxy exception list:

Lewis John McGibbney (JIRA) Sat, 03 Nov 2012 08:41:13 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lewis John McGibbney updated NUTCH-208:
---------------------------------------

    Attachment: NUTCH-208-2.x.patch
                NUTCH-208.patch

The attached patches address this issue for trunk and 2.x. This has been used 
effectively when crawling from behind a University proxy and a local tinyproxy 
proxy configuration. I confirm (logs from Nutch trunk) as follows

{code}
lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk_head/runtime/local$ ./bin/nutch 
fetch crawldb/segment/20121103152653 -threads 5
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
Fetcher: starting at 2012-11-03 15:27:53
Fetcher: segment: crawldb/segment/20121103152653
Using queue mode : byHost
Fetcher: threads: 5
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.heraldscotland.com/
Using queue mode : byHost
fetching http://www.theoatmeal.com/
Using queue mode : byHost
fetching http://www.bbc.co.uk/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.bbc.co.uk/ failed with: Http code=403, 
url=http://www.bbc.co.uk/
fetch of http://www.heraldscotland.com/ failed with: Http code=403, 
url=http://www.heraldscotland.com/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-11-03 15:28:05, elapsed: 00:00:11
{code}

Here you can see both "http://www.bbc.co.uk/"; and 
http://www.heraldscotland.com/"; fail with 403's, this is because they are 
blocked by my proxy.

Any comments? An issue is that there is no JUnit test to accompany... I am 
unsure how to implement this currently.
                
> http: proxy exception list:
> ---------------------------
>
>                 Key: NUTCH-208
>                 URL: https://issues.apache.org/jira/browse/NUTCH-208
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.8, 1.3, nutchgora
>            Reporter: Matthias Günter
>            Assignee: Lewis John McGibbney
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.6
>
>         Attachments: NUTCH-208-2.x.patch, 
> NUTCH-208-branch-1.4-20110210-v3.patch, NUTCH-208-branch-1.4-20110807.patch, 
> NUTCH-208-branch-1.4-20110809-v2.patch, NUTCH-208.patch, 
> NUTCH-208-trunk-2.0-20110810.patch, NUTCH-208-trunk-2.0-20110810-v2.patch, 
> patch.txt, patch.txt, proxy_exception_list-0.8.diff
>
>
> I suggest that a parameter is added to nutch-default.xml which allows to 
> generate a proxy exception list. 
> <property>
>   <name>http.proxy.exception.list</name>
>   <value></value>
>   <description>URL's and hosts that don't use the proxy (e.g. 
> intranets)</description>
> </property>
> This is useful when scanning intranet/internet combinations from behind a 
> firewall. A preliminary patch is added to this extend to this request, 
> showing the changes. We will test it and update it if necessary. this also 
> reflects the reality in web browsers, where there is in most cases an 
> exception list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-208) http: proxy exception list:

Reply via email to