Thanks Andy, that really helps a lot!!. On Apr 9, 2005 6:50 PM, Andy Liu <[EMAIL PROTECTED]> wrote: > By default, Nutch only follows the first 100 links on any given page. > You can change this value in nutch-site.xml . > > On Apr 8, 2005 11:23 PM, Eric Money <[EMAIL PROTECTED]> wrote: > > Like when I was crawling http://www.cc.gatech.edu/grads/, it add most > > pages like http://www.cc.gatech.edu/grads/d/don, > > http://www.cc.gatech.edu/grads/k/David.Krum, > > but it will ignore some which has the same grammers as above, for example, > > my nutch will ignore http://www.cc.gatech.edu/grads/h/Yan.Huang > > > > I cannot figure out why. Maybe you guys can try, here is my approach > > 1. in urls: http://www.cc.gatech.edu/grads/ > > 2. in crawl-urlfilter.txt: > > +^http://www.cc.gatech.edu/grads/ > > +^http://www.cc.gatech.edu/grads/([a-z0-9]*\.//)* > > > > and crawl for depth 3. Hope somebody could explain what happened, thanks > > all. > > >
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
