Would anyone care to comment on the speed of this please? Seems awfully
long to me.
20 threads, a crawl took 25 hours for about 400K URL's. It's now been
updating for 20 hours and is not yet complete.
System:
- nutch 0.7
- P4 2.8, 1 gig of ram
- No problems on the internet connection (I had to throttle back the
threads open).
- We do have a pretty heavy whitelist in the regular expression filter
for domains.
Two days to crawl and index 400K pages is too long. Is my answer as
simple as getting bigger hardware and paying for a bigger pipe?
Thanks.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general