Insurance Squared Inc. wrote:
My ISP called and said my nutch crawler is chewing up 20mbits on a
line he's only supposed to be using 10. Is there an easy way to
tinker with how much bandwidth we're using at once? I know we can
change the number of open threads the crawler has, but it seems to me
this won't make a huge difference. If I chop the number of open
threads in half, it'll just download half the pages, twice as fast? I
stand to be corrected on this.
Any other thoughts? doesn't have to be correct or elegant as long as
it works.
Failing a reasonable solution in nutch, is there some sort of linux
level tool that will easily allow me to throttle how much bandwidth
the crawl is using at once?
I put my cluster behind a m0n0wall (http://m0n0.ch), which has a
built-in traffic shaper. This is based on FreeBSD, which I prefer over
Linux for such applications, but there are similar Linux solutions, or
commercial routers with built-in traffic shaping.
I think that you could also play some tricks with a bandwidth-limiting
proxy server, because protocol-httpclient can use a proxy.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com