Insurance Squared Inc. wrote: > > >> >>> As a second indicator of the scale, IIRC Doug Cutting posted a while >>> ago that he downloaded and indexed 50 million pages in a day or two >>> with about 10 servers. >>> We download about 100,000 pages per hour on a dedicated 10mbs >>> connection. Nutch will definitely fill more than a 10mbs connection >>> though, I scaled the system back to only use 10mbs before I went >>> broke :). >>> >> Could you please send config info and what hardware is used for >> crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s. >> >> -- > > For that I'm using a Dell 1750 with dual Xeon's and 8gigs of ram. > Though I can get the same with only a single p4 processor. You've > likely got one of two issues. First is you don't actually have a > 100mbs connection; somewhere there's a bottleneck. Secondly, watch > the limit on the size of the files you crawl. I think we limit our > file size to 64K. If you have that limit too big you end up spending > all day downloading 10meg pdf's; that'll really slow things down. > Nice server. We've add more power to disks but I think CPU is real bottleneck. When doing MapReduce server is running 97%.
file.content.limit is set to 65536, http.content.limit is the same. Can you post nutch-site.xml values. I'm specially curious about number of threads (total, per server), limits, delays etc. Thanks -- Uros ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
