RE: why did nutch miss so many links when crawling?

kevin.Y Fri, 24 Aug 2007 17:53:50 -0700

Thank you, Vishal !
I modified the db.max.outlinks.per.page property as you said. I think that
is the problem. And it works ! Thanks a lot !


Regards,
Keven


Vishal Shah-3 wrote:
> 
> Hi Kevin,
> 
>    Are you using the crawl tool for crawling? If no, then you should look
> at
> the regex patterns in regex-urlfilter.txt.
> 
>    Also, look at the db.max.outlinks.per.page property in nutch-site.xml.
> This property determines how many outlinks from a page are processed.
> 
> Regards,
> 
> -vishal.
> 
> -----Original Message-----
> From: kevin.Y [mailto:[EMAIL PROTECTED] 
> Sent: Friday, August 24, 2007 4:22 PM
> To: [email protected]
> Subject: why did nutch miss so many links when crawling?
> 
> 
> hi! 
> I got a problem when using nutch-0.9 to crawl a site.
> There was an article-list-page , and in this page there were many links
> which point to the article-pages.
> So i made nutch crawl starting with the list-page so that those articles
> could be indexed.
> However during the crawling i found nutch ignored all those article links
> !
> At last , none of those articles but some other pages could be indexed. I
> tried several times and got the same problem.
> I'm sure there's no problem with the conf/crawl-urlfilter.txt.(
> +^http://([a-z0-9]*\.)*site.com/ )
> Doesn't nutch pull out all the links from a page and crawl them all? Have
> i
> made some stupid mistakes?
> any help ??
> 
> any reply will be great appreciated!
> -- 
> View this message in context:
> http://www.nabble.com/why-did-nutch-miss-so-many-links-when-crawling--tf4322
> 916.html#a12310200
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/why-did-nutch-miss-so-many-links-when-crawling--tf4322916.html#a12322016
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: why did nutch miss so many links when crawling?

Reply via email to