That makes sense. It's worth pointing out that tasks are scheduled on a "pull" basis -- tasktrackers ask for more work if they have free slots for tasks -- so it is not a given that all nodes will receive the same number of tasks. If some tasks take considerably longer (or some nodes are faster/slower than the others), then those nodes may request fewer tasks from the jobtracker as the job runs -- so in practice you may not get results as even as your table suggests.
- Aaron On Wed, Nov 11, 2009 at 8:30 AM, Rob Stewart <[email protected]>wrote: > > Hi Aaron, > > your response was very useful indeed, thank you very much. > > OK, I've documented the scenario (relevent to my experiments), where the > cluster is very small, only 10 nodes. > > I have uploaded this section only to : > > http://linuxsoftwareblog.com/Hadoop/small_cluster_scenario.png > > Can I ask, does the paragraph, and the subsequent table make relevant sense > to you, and does it reflect the true nature of potential performance issues > with very small MapReduce task count on an unreliable small cluster? > > > thanks, > > > Rob Stewart > > > > > 2009/11/9 Aaron Kimball <[email protected]> > > >> >> On Sat, Nov 7, 2009 at 11:01 AM, Rob Stewart <[email protected] >> > wrote: >> >>> Hi, briefly, I'm writing my dissertation on Distributed computation, and >>> then in detail at the various interfaces atop of Hadoop, including Pig, >>> Hive, JAQL etc... >>> One thing I have noticed in early testing is that Pig tends to generate >>> more Map tasks for a given query, than other interfaces for identical query >>> design. >>> >>> So my question to you MapReduce folks is this: >>> ------------ >>> If there are 100 Map jobs, spread across 10 DataNodes, and one DataNode >>> fails, then approximately 10 Map jobs will be redistributed over the >>> remaining 9 DataNodes. If, however, there were 500 Map jobs over the 10 >>> DataNodes, one of them fails, then 50 Map jobs will be reallocated to the >>> remaining 9 DataNodes. Am I to expect a difference in overal performance in >>> both of these scenario's? >>> >> >> >> That depends entirely on how heavy the workloads are in these various >> jobs. By the way; the term-of-art in Hadoop is "map task" -- a single >> MapReduce "job" contains a set of map tasks and a set of reduce tasks, each >> of which may be executed multiple times (e.g., in the event of node >> failure); such re-executions of a given task are known as task attempts. >> >> At any rate, it is likely that on a small number of nodes (e.g., 10) a >> higher number of more-granular tasks will result in better overall >> performance. If the 500 tasks were doing the same work as the 10 tasks in >> the original case, then were a node to fail, 50 tasks would need to be >> redistributed. This can happen in several ways depending on which nodes have >> free resources; likely all 9 healthy nodes will share in the work. Whereas >> if there was a single task on the failed node, then only one other machine >> could pick that up. >> >> On the other hand, there is a cost associated with the setup/teardown of >> tasks as well as merging their results for reducers; breaking up work into >> more tasks is good to a point, but going from 10 to 500 is likely to slow >> down the overall result in the average no-failure case. I think most folks >> strive for average task size to be anywhere from 128 MB to 2048 MB of >> uncompressed data. A 50x task granularity improvement would push average >> task running times well below the threshold of usability. >> >> - Aaron >> >> >> >>> ----------- >>> >>> The reason for wanting to know this is to perhaps discuss in more detail >>> as to whether, in a situation where many faults on the cluster occur, an >>> Hadoop job with many Map/Reduce tasks will handle the unreliability better >>> than an Hadoop job that has much fewer Map/Reduce tasks? >>> >> >> Depends what you mean by "handle the unreliability." The MapReduce >> platform will get your job done in either case. There is no inherent >> resilience or weakness to failure based on number of tasks. As stated above, >> more granular tasks may result in more even redistribution of additional >> work. >> >> >>> >>> If this were the case, is it true to state that, if the reliability of an >>> Hadoop cluster (network reliability, DataNode reliability etc...) were known >>> before a job was sent to the cluster, the user submitting the job would want >>> to adjust the number of Map/Reduce tasks dependant on the reliability? >>> >>> >> Not likely. Other parameters, such as the acceptable number of attempt >> failures per task, various timeout values, etc, are considerably better >> tuning parameters given a baseline reliability profile for a cluster. >> >> Any given node in a cluster is likely to be less reliable as the size of >> the cluster grows. e.g., if you have 200 nodes, that's likely to result in >> more frequent node failures than a cluster of 10 nodes. And job size will >> grow with cluster size (you don't buy 200 nodes unless you have a lot of >> work to do). So a job running "at scale" of 50, 100, or 200 nodes will often >> see anywhere from 500 to 10,000 map tasks anyway. This already represents >> plenty of opportunity for granular work reassignment; switching a given job >> from 3,000 to 6,000 map tasks is unlikely to improve the overall running >> time very much. Since individual tasks in any multi-thousand-task job will >> likely have a high degree of variance in runtime, the efficiency of the >> jobtracker of scheduling work won't necessarily be improved by having that >> many more tasks anyway. >> >> - Aaron >> >> >>> >>> I may be well off course with this idea, and if this is the case, do let >>> me know! >>> >>> >>> thanks, >>> >>> >>> Rob Stewart >>> >> >> >
