[Nutch-general] Re: Why nutch ignore some URLs?

Eric Money Mon, 11 Apr 2005 04:43:28 -0700

Thanks Andy, that really helps a lot!!.

On Apr 9, 2005 6:50 PM, Andy Liu <[EMAIL PROTECTED]> wrote:
> By default, Nutch only follows the first 100 links on any given page.
> You can change this value in nutch-site.xml .
> 
> On Apr 8, 2005 11:23 PM, Eric Money <[EMAIL PROTECTED]> wrote:
> > Like when I was crawling http://www.cc.gatech.edu/grads/, it add most
> > pages like http://www.cc.gatech.edu/grads/d/don,
> > http://www.cc.gatech.edu/grads/k/David.Krum,
> > but it will ignore some which has the same grammers as above, for example,
> > my nutch will ignore http://www.cc.gatech.edu/grads/h/Yan.Huang
> >
> > I cannot figure out why. Maybe you guys can try, here is my approach
> > 1. in urls: http://www.cc.gatech.edu/grads/
> > 2. in crawl-urlfilter.txt:
> > +^http://www.cc.gatech.edu/grads/
> > +^http://www.cc.gatech.edu/grads/([a-z0-9]*\.//)*
> >
> > and crawl for depth 3. Hope somebody could explain what happened, thanks 
> > all.
> >
>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Why nutch ignore some URLs?

Reply via email to