I'm also getting very slow crawl rates of around 6 pages/sec. I haven't been able to analyze this issue at length yet, for instance by using ntop to see if my network connection is pegged. However, I did get a slightly better result (about 20%-30% better) by following Sami Siren's suggestion: http://www.mail-archive.com/[email protected]/msg06533.html
I've also tried fetch2 which I think is faster, though again I'm not getting a radical improvement. --Kai ----- Original Message ---- From: Audrey Liu <[EMAIL PROTECTED]> To: [email protected] Sent: Friday, July 20, 2007 1:56:52 PM Subject: tweaking config files for better performance Hi, I am using Nutch 0.9, and I'm trying to crawl our Intranet site (~60,000 pages, ~28,000 htmls). I've seen other posts where people mentioned they can get their crawler to do 20pages/sec, and the best I've seen so far is only 8 pages/sec. I've also read that the fetcher threads tend to block when it tries to fetch pages from the same host. So I'm wondering what kind of configurations should I set to get the best performance, my current configurations in nutch-site.xml is as follows: <property> <name>fetcher.threads.fetch</name> <value>200</value> </property> <property> <name>fetcher.threads.per.host</name> <value>50</value> </property> <property> <name>http.max.delays</name> <value>1</value> </property> Any pointers are greatly appreciated!! Thanks in advance. AL -- View this message in context: http://www.nabble.com/tweaking-config-files-for-better-performance-tf4119552.html#a11715927 Sent from the Nutch - User mailing list archive at Nabble.com. ____________________________________________________________________________________ Got a little couch potato? Check out fun summer activities for kids. http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz
