Re: -numFetchers in generate command

Andrzej Bialecki Wed, 30 Aug 2006 00:55:13 -0700

Vishal Shah wrote:

Hey Andrei,


  Thanks a lot for the reply. That clears up a major doubt in my mind.
Fyi, I experimented using a single machine to crawl using Hadoop DFS,
MapReduce. The largest experiment was to crawl around 300K pages from a
few thousand hosts. I could push the crawler to a speed of around 27
pages/sec when using 2000 threads. When I increased the number of

threads to more than 3000, the jobs started failing.

Look into the logs - most probably fetching failed because of protocoltimeouts, which could indicate that you saturated your availablebandwidth. You can calculate the max throughput of your line and see ifthese 27 pages/s is near this limit. If it is, then increasing thenumber of threads, or the number of machines won't speed up things.

I am now going to conduct a larger experiment on 3-4 machines. Will
report the performance once I am done. In this case, since I know the
optimal # of threads on 1 machine is 2000, should I scale the #threads
linearly to say 6000 for 3 machines, or just increasing the number of
map/red tasks linearly will take care of the scaling?

If you hit the max. bandwidth available for you, then adding moremachines with the same number of threads will only cause more fetches tofail because of timeouts - in such case you should decrease the numberof threads accordingly.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: -numFetchers in generate command

Reply via email to