Vishal Shah wrote:
Hey Andrei,
Thanks a lot for the reply. That clears up a major doubt in my mind.
Fyi, I experimented using a single machine to crawl using Hadoop DFS,
MapReduce. The largest experiment was to crawl around 300K pages from a
few thousand hosts. I could push the crawler to a speed of around 27
pages/sec when using 2000 threads. When I increased the number of
threads to more than 3000, the jobs started failing.
Look into the logs - most probably fetching failed because of protocol
timeouts, which could indicate that you saturated your available
bandwidth. You can calculate the max throughput of your line and see if
these 27 pages/s is near this limit. If it is, then increasing the
number of threads, or the number of machines won't speed up things.
I am now going to conduct a larger experiment on 3-4 machines. Will
report the performance once I am done. In this case, since I know the
optimal # of threads on 1 machine is 2000, should I scale the #threads
linearly to say 6000 for 3 machines, or just increasing the number of
map/red tasks linearly will take care of the scaling?
If you hit the max. bandwidth available for you, then adding more
machines with the same number of threads will only cause more fetches to
fail because of timeouts - in such case you should decrease the number
of threads accordingly.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com