Hi, briefly, I'm writing my dissertation on Distributed computation, and then in detail at the various interfaces atop of Hadoop, including Pig, Hive, JAQL etc... One thing I have noticed in early testing is that Pig tends to generate more Map tasks for a given query, than other interfaces for identical query design.
So my question to you MapReduce folks is this: ------------ If there are 100 Map jobs, spread across 10 DataNodes, and one DataNode fails, then approximately 10 Map jobs will be redistributed over the remaining 9 DataNodes. If, however, there were 500 Map jobs over the 10 DataNodes, one of them fails, then 50 Map jobs will be reallocated to the remaining 9 DataNodes. Am I to expect a difference in overal performance in both of these scenario's? ----------- The reason for wanting to know this is to perhaps discuss in more detail as to whether, in a situation where many faults on the cluster occur, an Hadoop job with many Map/Reduce tasks will handle the unreliability better than an Hadoop job that has much fewer Map/Reduce tasks? If this were the case, is it true to state that, if the reliability of an Hadoop cluster (network reliability, DataNode reliability etc...) were known before a job was sent to the cluster, the user submitting the job would want to adjust the number of Map/Reduce tasks dependant on the reliability? I may be well off course with this idea, and if this is the case, do let me know! thanks, Rob Stewart
