Dennis, thanks for the note. It turns out that I am crawling my intranet
so firewalls aren't an issue... I did do some further investigation and
I discovered that high port numbers aren't the problem, either... 

The *problem* is with an interaction with the Roller blogging software.
When the Nutch fetcher crawls to the site, I get the following error:

--
2007-12-04 10:47:02,689 INFO  fetcher.Fetcher - fetch of
http://blogs.wrc.xerox.com:8080/roller failed with:
java.lang.NullPointerException
--

I went to this with "curl" and I see that Roller re-directs the HTTP
client to the same URL but with a "/" at the end:

----
HTTP/1.0 302 Moved Temporarily
Connection: close
Server: Apache-Coyote/1.1
Location: http://blogs.wrc.xerox.com:8080/roller/
Date: Tue, 04 Dec 2007 19:14:54 GMT
----

Somehow this is causing the Fetcher to through an exception. For now, I
am going to seed my crawler with the re-directed URL but somebody might
want to look at this....

 -Lee


-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 03, 2007 1:43 PM
To: [email protected]
Subject: Re: crawing for content on port 8080

Are there errors showing up in the log file while fetching.  My guess 
would be some type of firewall preventing outbound connections on 
anything but port 80.  But if this is the case then timeout errors 
should be showing up in the logs.

Dennis Kubes

Moore, Lee C wrote:
> Hello,
>  
> I would like to have Nutch crawl web sites that are on ports other
than
> port 80.  So, I changed the regex-urlfilter.txt file so that it would
> allow an optional port number on the URL.  I see the URLs with high
> ports show up as candidtates from my seed list but they aren't
actually
> fetched.  Would anybody be able to help me to understand what I might
> do?
>  
> thanks,
>  
>  Lee
>  
> 

Reply via email to