[
https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-208:
---------------------------------------
Attachment: NUTCH-208-2.x.patch
NUTCH-208.patch
The attached patches address this issue for trunk and 2.x. This has been used
effectively when crawling from behind a University proxy and a local tinyproxy
proxy configuration. I confirm (logs from Nutch trunk) as follows
{code}
lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk_head/runtime/local$ ./bin/nutch
fetch crawldb/segment/20121103152653 -threads 5
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-11-03 15:27:53
Fetcher: segment: crawldb/segment/20121103152653
Using queue mode : byHost
Fetcher: threads: 5
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.heraldscotland.com/
Using queue mode : byHost
fetching http://www.theoatmeal.com/
Using queue mode : byHost
fetching http://www.bbc.co.uk/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.bbc.co.uk/ failed with: Http code=403,
url=http://www.bbc.co.uk/
fetch of http://www.heraldscotland.com/ failed with: Http code=403,
url=http://www.heraldscotland.com/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-11-03 15:28:05, elapsed: 00:00:11
{code}
Here you can see both "http://www.bbc.co.uk/" and
http://www.heraldscotland.com/" fail with 403's, this is because they are
blocked by my proxy.
Any comments? An issue is that there is no JUnit test to accompany... I am
unsure how to implement this currently.
> http: proxy exception list:
> ---------------------------
>
> Key: NUTCH-208
> URL: https://issues.apache.org/jira/browse/NUTCH-208
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Affects Versions: 0.8, 1.3, nutchgora
> Reporter: Matthias Günter
> Assignee: Lewis John McGibbney
> Priority: Trivial
> Labels: patch
> Fix For: 1.6
>
> Attachments: NUTCH-208-2.x.patch,
> NUTCH-208-branch-1.4-20110210-v3.patch, NUTCH-208-branch-1.4-20110807.patch,
> NUTCH-208-branch-1.4-20110809-v2.patch, NUTCH-208.patch,
> NUTCH-208-trunk-2.0-20110810.patch, NUTCH-208-trunk-2.0-20110810-v2.patch,
> patch.txt, patch.txt, proxy_exception_list-0.8.diff
>
>
> I suggest that a parameter is added to nutch-default.xml which allows to
> generate a proxy exception list.
> <property>
> <name>http.proxy.exception.list</name>
> <value></value>
> <description>URL's and hosts that don't use the proxy (e.g.
> intranets)</description>
> </property>
> This is useful when scanning intranet/internet combinations from behind a
> firewall. A preliminary patch is added to this extend to this request,
> showing the changes. We will test it and update it if necessary. this also
> reflects the reality in web browsers, where there is in most cases an
> exception list.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira