Insurance Squared Inc. wrote:
My ISP called and said my nutch crawler is chewing up 20mbits on a
line he's only supposed to be using 10. Is there an easy way to
tinker with how much bandwidth we're using at once? I know we can
change the number of open threads the crawler has, but it seems to me
this won't make a huge difference. If I chop the number of open
threads in half, it'll just download half the pages, twice as fast? I
stand to be corrected on this.
Any other thoughts? doesn't have to be correct or elegant as long as
it works.
Failing a reasonable solution in nutch, is there some sort of linux
level tool that will easily allow me to throttle how much bandwidth
the crawl is using at once?
I put my cluster behind a m0n0wall (http://m0n0.ch), which has a
built-in traffic shaper. This is based on FreeBSD, which I prefer over
Linux for such applications, but there are similar Linux solutions, or
commercial routers with built-in traffic shaping.
I think that you could also play some tricks with a bandwidth-limiting
proxy server, because protocol-httpclient can use a proxy.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general