Vishal Shah wrote:
Hey Andrei,

  Thanks a lot for the reply. That clears up a major doubt in my mind.
Fyi, I experimented using a single machine to crawl using Hadoop DFS,
MapReduce. The largest experiment was to crawl around 300K pages from a
few thousand hosts. I could push the crawler to a speed of around 27
pages/sec when using 2000 threads. When I increased the number of
threads to more than 3000, the jobs started failing.

Look into the logs - most probably fetching failed because of protocol timeouts, which could indicate that you saturated your available bandwidth. You can calculate the max throughput of your line and see if these 27 pages/s is near this limit. If it is, then increasing the number of threads, or the number of machines won't speed up things.

I am now going to conduct a larger experiment on 3-4 machines. Will
report the performance once I am done. In this case, since I know the
optimal # of threads on 1 machine is 2000, should I scale the #threads
linearly to say 6000 for 3 machines, or just increasing the number of
map/red tasks linearly will take care of the scaling?

If you hit the max. bandwidth available for you, then adding more machines with the same number of threads will only cause more fetches to fail because of timeouts - in such case you should decrease the number of threads accordingly.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to