Hi, briefly, I'm writing my dissertation on Distributed computation, and
then in detail at the various interfaces atop of Hadoop, including Pig,
Hive, JAQL etc...
One thing I have noticed in early testing is that Pig tends to generate more
Map tasks for a given query, than other interfaces for identical query
design.

So my question to you MapReduce folks is this:
------------
 If there are 100 Map jobs, spread across 10 DataNodes, and one DataNode
fails, then approximately 10 Map jobs will be redistributed over the
remaining 9 DataNodes. If, however, there were 500 Map jobs over the 10
DataNodes, one of them fails, then 50 Map jobs will be reallocated to the
remaining 9 DataNodes. Am I to expect a difference in overal performance in
both of these scenario's?
-----------

The reason for wanting to know this is to perhaps discuss in more detail as
to whether, in a situation where many faults on the cluster occur, an Hadoop
job with many Map/Reduce tasks will handle the unreliability better than an
Hadoop job that has much fewer Map/Reduce tasks?

If this were the case, is it true to state that, if the reliability of an
Hadoop cluster (network reliability, DataNode reliability etc...) were known
before a job was sent to the cluster, the user submitting the job would want
to adjust the number of Map/Reduce tasks dependant on the reliability?


I may be well off course with this idea, and if this is the case, do let me
know!


thanks,


Rob Stewart

Reply via email to