Re: Why nutch ignore some URLs?

Andy Liu Sat, 09 Apr 2005 15:51:00 -0700

By default, Nutch only follows the first 100 links on any given page. 
You can change this value in nutch-site.xml .


On Apr 8, 2005 11:23 PM, Eric Money <[EMAIL PROTECTED]> wrote:
> Like when I was crawling http://www.cc.gatech.edu/grads/, it add most
> pages like http://www.cc.gatech.edu/grads/d/don,
> http://www.cc.gatech.edu/grads/k/David.Krum,
> but it will ignore some which has the same grammers as above, for example,
> my nutch will ignore http://www.cc.gatech.edu/grads/h/Yan.Huang
> 
> I cannot figure out why. Maybe you guys can try, here is my approach
> 1. in urls: http://www.cc.gatech.edu/grads/
> 2. in crawl-urlfilter.txt:
> +^http://www.cc.gatech.edu/grads/
> +^http://www.cc.gatech.edu/grads/([a-z0-9]*\.//)*
> 
> and crawl for depth 3. Hope somebody could explain what happened, thanks all.
>

Re: Why nutch ignore some URLs?

Reply via email to