Re: crawl stops at depth 1

Barry Haddow Mon, 18 Feb 2008 10:18:33 -0800

Hi 

I think I've solved the problem. When I turned up logging I found that the 
Generator's FetchSchedule was rejecting all the potential urls since they had 
a fetch time in the future. This was because the clocks on the slaves were 
all slightly ahead of the master clock. So the moral of the story is, make 
sure that you synchronise clocks on your cluster otherwise nutch may fail,


regards
Barry

On Thursday 14 February 2008 16:31:14 Barry Haddow wrote:
> Hi
>
> I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1
> even though there should be more data to fetch. I can download a list of
> urls without any problem using FreeGenerator, but the recursive crawl is
> not working for me.
>
> I have the crawl-urlfilter.txt set up to accept any url, and the plugins
> configured to use this filter
> <name>plugin.includes</name>
>   <value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)|
> index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
> urlnormalizer-(pass|regex|basic)|feed</value>
>
> The only other nutch configs that I've changed are the robot settings.
>
> If I inspect the crawldb after a run I see that it's fetched the 3 seed
> pages and refused to fetch anything else:
>
> TOTAL urls:     248
> retry 0:        248
> min score:      0.0090
> avg score:      0.03530645
> max score:      2.029
> status 1 (db_unfetched):        245
> status 2 (db_fetched):  3
>
> How can I get nutch to fetch the rest of the urls?
>
> thanks in advance for your help,
>
> Barry
>
> ps: here's my crawl-urlfilter.txt
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*apache.org/
>
> # skip everything else
> #-.
> +.*

Re: crawl stops at depth 1

Reply via email to