Hi Phil, > -----Original Message----- > From: Phil Barnett [mailto:ph...@philb.us] > Sent: Wednesday, 21 April 2010 8:39 AM > To: nutch-user@lucene.apache.org > Subject: Question about crawler. > > Is there some place to tell why the crawler has rejected a page? I'm > trying > to get 1.1 working and basically it doesn't seem to crawl the same way > that > 1.0 does. > > I have tika included in the parse- section of conf/nutch-site.xml > > I have DEBUG set for all the crawl sections, but it doesn't really say > why > it's rejecting a site. > > I have the crawler set to not follow external links and I seed the top > level > of each site. > > I'm just unclear on how to proceed to troubleshoot this.
Nutch observes robots.txt instructions. Examine robots.txt files of sites that get rejected. The cause may be there. You mention that 1.0 used to crawl these sites OK. Did you copy its configuration to 1.1? In particular, the agent id string? If this looks OK and DEBUG output does not tell you enough, you can insert extra output in URL filters involved. Regards, Arkadi