An alternative is to have 2 Tasktracker clusters, where the nodes are on the same machines. One cluster is for IO intensive jobs and has a low number of map/reduces per tracker, the other cluster is for cpu intensive jobs and has a high number of map/reduces per tracker.
The alternative, simpler method is to use a multi-threaded mapper on the cpu intensive jobs, where you tune the thread count on a per job basis. In the longer term being able to alter the Tasktracker control parameters at run time on a per job basis would be wonderful. On Tue, Feb 3, 2009 at 12:01 PM, Nathan Marz <nat...@rapleaf.com> wrote: > This is a great idea. For me, this is related to: > https://issues.apache.org/jira/browse/HADOOP-5160. Being able to set the > number of tasks per machine on a job by job basis would allow me to solve my > problem in a different way. Looking at the Hadoop source, it's also probably > simpler than changing how Hadoop schedules tasks. > > > > > > On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: > > Chris, >> >> For my specific use cases, it would be best to be able to set N >> mappers/reducers per job per node (so I can explicitly say, run at most 2 >> at >> a time of this CPU bound task on any given node). However, the other way >> would work as well (on 10 node system, would set job to max 20 tasks at a >> time globally), but opens up the possibility that a node could be assigned >> more than 2 of that task. >> >> I would work with whatever is easiest to implement as either would be a >> vast >> improvement for me (can run high numbers of network latency bound tasks >> without fear of cpu bound tasks killing the cluster). >> >> JG >> >> >> >> -----Original Message----- >>> From: Chris K Wensel [mailto:ch...@wensel.net] >>> Sent: Tuesday, February 03, 2009 11:34 AM >>> To: core-user@hadoop.apache.org >>> Subject: Re: Control over max map/reduce tasks per job >>> >>> Hey Jonathan >>> >>> Are you looking to limit the total number of concurrent mapper/ >>> reducers a single job can consume cluster wide, or limit the number >>> per node? >>> >>> That is, you have X mappers/reducers, but only can allow N mappers/ >>> reducers to run at a time globally, for a given job. >>> >>> Or, you are cool with all X running concurrently globally, but want to >>> guarantee that no node can run more than N tasks from that job? >>> >>> Or both? >>> >>> just reconciling the conversation we had last week with this thread. >>> >>> ckw >>> >>> On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote: >>> >>> All, >>>> >>>> >>>> >>>> I have a few relatively small clusters (5-20 nodes) and am having >>>> trouble >>>> keeping them loaded with my MR jobs. >>>> >>>> >>>> >>>> The primary issue is that I have different jobs that have drastically >>>> different patterns. I have jobs that read/write to/from HBase or >>>> Hadoop >>>> with minimal logic (network throughput bound or io bound), others >>>> >>> that >>> >>>> perform crawling (network latency bound), and one huge parsing >>>> streaming job >>>> (very CPU bound, each task eats a core). >>>> >>>> >>>> >>>> I'd like to launch very large numbers of tasks for network latency >>>> bound >>>> jobs, however the large CPU bound job means I have to keep the max >>>> maps >>>> allowed per node low enough as to not starve the Datanode and >>>> Regionserver. >>>> >>>> >>>> >>>> I'm an HBase dev but not familiar enough with Hadoop MR code to even >>>> know >>>> what would be involved with implementing this. However, in talking >>>> with >>>> other users, it seems like this would be a well-received option. >>>> >>>> >>>> >>>> I wanted to ping the list before filing an issue because it seems >>>> >>> like >>> >>>> someone may have thought about this in the past. >>>> >>>> >>>> >>>> Thanks. >>>> >>>> >>>> >>>> Jonathan Gray >>>> >>>> >>> -- >>> Chris K Wensel >>> ch...@wensel.net >>> http://www.cascading.org/ >>> http://www.scaleunlimited.com/ >>> >> >> >