On Tue, Apr 20, 2010 at 7:02 PM, <arkadi.kosmy...@csiro.au> wrote: > Hi Phil, > > > -----Original Message----- > > From: Phil Barnett [mailto:ph...@philb.us] > > Sent: Wednesday, 21 April 2010 8:39 AM > > To: email@example.com > > Subject: Question about crawler. > > > > Is there some place to tell why the crawler has rejected a page? I'm > > trying > > to get 1.1 working and basically it doesn't seem to crawl the same way > > that > > 1.0 does. > > > > I have tika included in the parse- section of conf/nutch-site.xml > > > > I have DEBUG set for all the crawl sections, but it doesn't really say > > why > > it's rejecting a site. > > > > I have the crawler set to not follow external links and I seed the top > > level > > of each site. > > > > I'm just unclear on how to proceed to troubleshoot this. > > Nutch observes robots.txt instructions. Examine robots.txt files of sites > that get rejected. The cause may be there. You mention that 1.0 used to > crawl these sites OK. Did you copy its configuration to 1.1? In particular, > the agent id string? > > If this looks OK and DEBUG output does not tell you enough, you can insert > extra output in URL filters involved. >
Yes, I copied the configuration files from 1.1. There are no robots.txt files on these servers. these are internal (WAN) servers and unsophisticated in every way. And the production 1.1 is still crawling them successfully. I'm not sure why the agent id string would matter but it is the same. Nobody is blocking me. These are just some home grown department content servers. Nothing special and no admin watching the web logs blocking things. But I will double check that there is not a robots.txt file there.