Hadoop Job Performance as MapReduce count increases?

Rob Stewart Sat, 07 Nov 2009 11:02:53 -0800

Hi, briefly, I'm writing my dissertation on Distributed computation, and
then in detail at the various interfaces atop of Hadoop, including Pig,
Hive, JAQL etc...
One thing I have noticed in early testing is that Pig tends to generate more
Map tasks for a given query, than other interfaces for identical query
design.


So my question to you MapReduce folks is this:
------------
 If there are 100 Map jobs, spread across 10 DataNodes, and one DataNode
fails, then approximately 10 Map jobs will be redistributed over the
remaining 9 DataNodes. If, however, there were 500 Map jobs over the 10
DataNodes, one of them fails, then 50 Map jobs will be reallocated to the
remaining 9 DataNodes. Am I to expect a difference in overal performance in
both of these scenario's?
-----------

The reason for wanting to know this is to perhaps discuss in more detail as
to whether, in a situation where many faults on the cluster occur, an Hadoop
job with many Map/Reduce tasks will handle the unreliability better than an
Hadoop job that has much fewer Map/Reduce tasks?

If this were the case, is it true to state that, if the reliability of an
Hadoop cluster (network reliability, DataNode reliability etc...) were known
before a job was sent to the cluster, the user submitting the job would want
to adjust the number of Map/Reduce tasks dependant on the reliability?


I may be well off course with this idea, and if this is the case, do let me
know!


thanks,


Rob Stewart

Hadoop Job Performance as MapReduce count increases?

Reply via email to