Dennis, thanks for the note. It turns out that I am crawling my intranet so firewalls aren't an issue... I did do some further investigation and I discovered that high port numbers aren't the problem, either...
The *problem* is with an interaction with the Roller blogging software. When the Nutch fetcher crawls to the site, I get the following error: -- 2007-12-04 10:47:02,689 INFO fetcher.Fetcher - fetch of http://blogs.wrc.xerox.com:8080/roller failed with: java.lang.NullPointerException -- I went to this with "curl" and I see that Roller re-directs the HTTP client to the same URL but with a "/" at the end: ---- HTTP/1.0 302 Moved Temporarily Connection: close Server: Apache-Coyote/1.1 Location: http://blogs.wrc.xerox.com:8080/roller/ Date: Tue, 04 Dec 2007 19:14:54 GMT ---- Somehow this is causing the Fetcher to through an exception. For now, I am going to seed my crawler with the re-directed URL but somebody might want to look at this.... -Lee -----Original Message----- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, December 03, 2007 1:43 PM To: [email protected] Subject: Re: crawing for content on port 8080 Are there errors showing up in the log file while fetching. My guess would be some type of firewall preventing outbound connections on anything but port 80. But if this is the case then timeout errors should be showing up in the logs. Dennis Kubes Moore, Lee C wrote: > Hello, > > I would like to have Nutch crawl web sites that are on ports other than > port 80. So, I changed the regex-urlfilter.txt file so that it would > allow an optional port number on the URL. I see the URLs with high > ports show up as candidtates from my seed list but they aren't actually > fetched. Would anybody be able to help me to understand what I might > do? > > thanks, > > Lee > >
