I had the same issue.
You need to use a tool like http://java-ntlm-proxy.sourceforge.net/ to
bypass the proxy.
You will have to edit the configuration file to add your proxy server
hostname, port, login and pwd.
Then you need to configure you nucth process to point to this process. You
shoudl add the following in nutch-site.xml
<property>
<name>http.proxy.host</name>
<value>hostname of the machine where is located the NTLMProxy</value>
<description>The proxy hostname. If empty, no proxy is
used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>port of the NTLMProxy process </value>
<description>The proxy port.</description>
</property>
I suggest also to add this property to avoid any conflict of reolution of
hostname:
<property>
<name>fetcher.threads.per.host.by.ip</name>
<value>false</value>
<description>ssssssssss.</description>
</property>
Hope it will help you
I tried to run Nutch 0.9 from my network, which require HTTP proxy access.
I have set up http.proxy.host and http.proxy.port properties in my
nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can
see it in log (see below). But still I get java.net.UnknownHostException.
Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really
tries to use proxy. And there is request from Nutch to proxy to get
robots.txt. It says "404 Not Found". There is no fallowing request for
particular page, only for robots.txt.
Any ideas what is wrong?
Marcin Okraszewski
007-05-15 17:38:59,465 INFO http.Http - http.proxy.host = <my_proxy_host>
2007-05-15 17:38:59,465 INFO http.Http - http.proxy.port =
<my_proxy_port>
2007-05-15 17:38:59,465 INFO http.Http - http.timeout = 10000
2007-05-15 17:38:59,465 INFO http.Http - http.content.limit = 65536
2007-05-15 17:38:59,465 INFO http.Http - http.agent =
YetAnotherSearchEngine/Nutch-0.9
2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.blocking =
true
2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.robots =
true
2007-05-15 17:38:59,466 INFO http.Http - fetcher.server.delay = 100
2007-05-15 17:38:59,466 INFO http.Http - http.max.delays = 100
2007-05-15 17:38:59,832 ERROR http.Http -
org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: <crawl_site>: <crawl_site>
2007-05-15 17:38:59,832 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
2007-05-15 17:38:59,832 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
HttpBase.java:212)
2007-05-15 17:38:59,832 ERROR http.Http - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
2007-05-15 17:38:59,832 ERROR http.Http - Caused by:
java.net.UnknownHostException: www.gral.pl: www.gral.pl
2007-05-15 17:38:59,832 ERROR http.Http - at
java.net.InetAddress.getAllByName0(InetAddress.java:1128)
2007-05-15 17:38:59,833 ERROR http.Http - at
java.net.InetAddress.getAllByName0(InetAddress.java:1098)
2007-05-15 17:38:59,833 ERROR http.Http - at
java.net.InetAddress.getAllByName(InetAddress.java:1061)
2007-05-15 17:38:59,833 ERROR http.Http - at
java.net.InetAddress.getByName(InetAddress.java:958)
2007-05-15 17:38:59,833 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
2007-05-15 17:38:59,834 INFO fetcher.Fetcher - fetch of <crawl_site>
failed with: org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: <crawl_site>: <crawl_site>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general