Hi Phil,

> -----Original Message-----
> From: Phil Barnett [mailto:ph...@philb.us]
> Sent: Wednesday, 21 April 2010 8:39 AM
> To: nutch-user@lucene.apache.org
> Subject: Question about crawler.
> 
> Is there some place to tell why the crawler has rejected a page? I'm
> trying
> to get 1.1 working and basically it doesn't seem to crawl the same way
> that
> 1.0 does.
> 
> I have tika included in the parse- section of conf/nutch-site.xml
> 
> I have DEBUG set for all the crawl sections, but it doesn't really say
> why
> it's rejecting a site.
> 
> I have the crawler set to not follow external links and I seed the top
> level
> of each site.
> 
> I'm just unclear on how to proceed to troubleshoot this.

Nutch observes robots.txt instructions. Examine robots.txt files of sites that 
get rejected. The cause may be there. You mention that 1.0 used to crawl these 
sites OK. Did you copy its configuration to 1.1? In particular, the agent id 
string?

If this looks OK and DEBUG output does not tell you enough, you can insert 
extra output in URL filters involved.

Regards,

Arkadi

Reply via email to