Insurance Squared Inc. wrote:
>
>
>>
>>> As a second indicator of the scale, IIRC Doug Cutting posted a while 
>>> ago that he downloaded and indexed 50 million pages in a day or two 
>>> with about 10 servers.
>>> We download about 100,000 pages per hour on a dedicated 10mbs 
>>> connection.  Nutch will definitely fill more than a 10mbs connection 
>>> though, I scaled the system back to only use 10mbs before I went 
>>> broke :).
>>>
>> Could you please send config info and what hardware is used for 
>> crawling. We've manage only 10,000 per hour sometimes less on 100Mbit/s.
>>
>> -- 
>
> For that I'm using a Dell 1750 with dual Xeon's and 8gigs of ram.  
> Though I can get the same with only a single p4 processor.  You've 
> likely got one of two issues.  First is you don't actually have a 
> 100mbs connection; somewhere there's a bottleneck.  Secondly, watch 
> the limit on the size of the  files you crawl. I think we limit our 
> file size to 64K.  If you have that limit too big you end up spending 
> all day downloading 10meg pdf's; that'll really slow things down.
>
Nice server. We've add more power to disks but I think CPU is real 
bottleneck. When doing MapReduce server is running 97%.

file.content.limit is set to 65536, http.content.limit is the same. Can 
you post nutch-site.xml values. I'm specially curious  about  number of 
threads (total, per server), limits, delays etc.

Thanks

--
Uros


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to