Marcin Okraszewski wrote:
I tried to run Nutch 0.9 from my network, which require HTTP proxy access. I
have set up http.proxy.host and http.proxy.port properties in my
nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can see
it in log (see below). But still I get java.net.UnknownHostException.
Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really tries to use
proxy. And there is request from Nutch to proxy to get robots.txt. It says "404 Not
Found". There is no fallowing request for particular page, only for robots.txt.
Any ideas what is wrong?
IIRC we had to patch Nutch in order to make it work with a proxy, but
that is Nutch 0.8 and I don't have this code available right now, but
you might want to search JIRA for possible patches. Whereas actually it
seems like something has been done
http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt
issues 21
HTH
Michael
Marcin Okraszewski
007-05-15 17:38:59,465 INFO http.Http - http.proxy.host = <my_proxy_host>
2007-05-15 17:38:59,465 INFO http.Http - http.proxy.port = <my_proxy_port>
2007-05-15 17:38:59,465 INFO http.Http - http.timeout = 10000
2007-05-15 17:38:59,465 INFO http.Http - http.content.limit = 65536
2007-05-15 17:38:59,465 INFO http.Http - http.agent =
YetAnotherSearchEngine/Nutch-0.9
2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.blocking = true
2007-05-15 17:38:59,465 INFO http.Http - protocol.plugin.check.robots = true
2007-05-15 17:38:59,466 INFO http.Http - fetcher.server.delay = 100
2007-05-15 17:38:59,466 INFO http.Http - http.max.delays = 100
2007-05-15 17:38:59,832 ERROR http.Http - org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: <crawl_site>: <crawl_site>
2007-05-15 17:38:59,832 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
2007-05-15 17:38:59,832 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212)
2007-05-15 17:38:59,832 ERROR http.Http - at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
2007-05-15 17:38:59,832 ERROR http.Http - Caused by:
java.net.UnknownHostException: www.gral.pl: www.gral.pl
2007-05-15 17:38:59,832 ERROR http.Http - at
java.net.InetAddress.getAllByName0(InetAddress.java:1128)
2007-05-15 17:38:59,833 ERROR http.Http - at
java.net.InetAddress.getAllByName0(InetAddress.java:1098)
2007-05-15 17:38:59,833 ERROR http.Http - at
java.net.InetAddress.getAllByName(InetAddress.java:1061)
2007-05-15 17:38:59,833 ERROR http.Http - at
java.net.InetAddress.getByName(InetAddress.java:958)
2007-05-15 17:38:59,833 ERROR http.Http - at
org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
2007-05-15 17:38:59,834 INFO fetcher.Fetcher - fetch of <crawl_site> failed with:
org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException:
<crawl_site>: <crawl_site>
--
Michael Wechner
Wyona - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
[EMAIL PROTECTED] [EMAIL PROTECTED]
+41 44 272 91 61