Hi Kevin, Are you using the crawl tool for crawling? If no, then you should look at the regex patterns in regex-urlfilter.txt.
Also, look at the db.max.outlinks.per.page property in nutch-site.xml. This property determines how many outlinks from a page are processed. Regards, -vishal. -----Original Message----- From: kevin.Y [mailto:[EMAIL PROTECTED] Sent: Friday, August 24, 2007 4:22 PM To: [email protected] Subject: why did nutch miss so many links when crawling? hi! I got a problem when using nutch-0.9 to crawl a site. There was an article-list-page , and in this page there were many links which point to the article-pages. So i made nutch crawl starting with the list-page so that those articles could be indexed. However during the crawling i found nutch ignored all those article links ! At last , none of those articles but some other pages could be indexed. I tried several times and got the same problem. I'm sure there's no problem with the conf/crawl-urlfilter.txt.( +^http://([a-z0-9]*\.)*site.com/ ) Doesn't nutch pull out all the links from a page and crawl them all? Have i made some stupid mistakes? any help ?? any reply will be great appreciated! -- View this message in context: http://www.nabble.com/why-did-nutch-miss-so-many-links-when-crawling--tf4322 916.html#a12310200 Sent from the Nutch - User mailing list archive at Nabble.com.
