Thank you, Vishal ! I modified the db.max.outlinks.per.page property as you said. I think that is the problem. And it works ! Thanks a lot !
Regards, Keven Vishal Shah-3 wrote: > > Hi Kevin, > > Are you using the crawl tool for crawling? If no, then you should look > at > the regex patterns in regex-urlfilter.txt. > > Also, look at the db.max.outlinks.per.page property in nutch-site.xml. > This property determines how many outlinks from a page are processed. > > Regards, > > -vishal. > > -----Original Message----- > From: kevin.Y [mailto:[EMAIL PROTECTED] > Sent: Friday, August 24, 2007 4:22 PM > To: [email protected] > Subject: why did nutch miss so many links when crawling? > > > hi! > I got a problem when using nutch-0.9 to crawl a site. > There was an article-list-page , and in this page there were many links > which point to the article-pages. > So i made nutch crawl starting with the list-page so that those articles > could be indexed. > However during the crawling i found nutch ignored all those article links > ! > At last , none of those articles but some other pages could be indexed. I > tried several times and got the same problem. > I'm sure there's no problem with the conf/crawl-urlfilter.txt.( > +^http://([a-z0-9]*\.)*site.com/ ) > Doesn't nutch pull out all the links from a page and crawl them all? Have > i > made some stupid mistakes? > any help ?? > > any reply will be great appreciated! > -- > View this message in context: > http://www.nabble.com/why-did-nutch-miss-so-many-links-when-crawling--tf4322 > 916.html#a12310200 > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > -- View this message in context: http://www.nabble.com/why-did-nutch-miss-so-many-links-when-crawling--tf4322916.html#a12322016 Sent from the Nutch - User mailing list archive at Nabble.com.
