There is some considerable and very understandable confusion about map tasks, mappers and input splits.
It is true that for large inputs the input should ultimately be split into chunks so that each core that you have has to process 10-100 pieces of data. To do that, however, you only need one map task per core, plus perhaps an extra or so to fill in any scheduling cracks. The input splits are good enough at splitting up the input data that you don't usually have to worry about that unless your data is for some reason unsplittable. In that case, you have to make sure you have enough input files to provide sufficient parallelism. Then you just set the number of map tasks per machine at a good level (I would recommend about 10 for you). You should set the limit on the number of map tasks per machine in hadoop-site.xml. When you drill into the data on the web control panel for the map-reduce system, you will eventually find data for individual map "tips". These are units of work and when they are assigned to tasks on nodes, you will be able to see that. Sometimes they will be assigned to more than one node if the job tracker thinks that might be a good idea for fault tolerance. Whenever a tip is completed, other tasks working on the same node will be killed. Does that help a bit? On 1/16/08 2:50 AM, "Johan Oskarsson" <[EMAIL PROTECTED]> wrote: > I simply followed the wiki "The right level of parallelism for maps > seems to be around 10-100 maps/node", > http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces > > We have 8 cores in each machine, so perhaps 100 mappers ought to be > right, it's set to 157 in the config but hadoop used ~200 for the job, > don't know why. That would of course help in this case, but what about > when we process large datasets? Especially if a mapper fails. > > Reducers I also setup to use ~1 per core, slightly less. > > /Johan > > Ted Dunning wrote: >> Why so many mappers and reducers relative to the number of machines you >> have? This just causes excess heartache when running the job. >> >> My standard practice is to run with a small factor larger than the number of >> cores that I have (for instance 3 tasks on a 2 core machine). In fact, I >> find it most helpful to have the cluster defaults rule the choice except in >> a few cases where I want one reducer or a few more than the standard 4 >> reducers. >> >> >> On 1/15/08 9:15 AM, "Johan Oskarsson" <[EMAIL PROTECTED]> wrote: >> >>> Hi. >>> >>> I believe someone posted about this a while back, but it's worth >>> mentioning again. >>> >>> I just ran a job on our 10 node cluster where the input data was >>> ~70 empty sequence files, with our default settings this ran about ~200 >>> mappers and ~70 reducers. >>> >>> The job took almost exactly two minutes to finish. >>> >>> How can we reduce this overhead? >>> >>> * Pick number of mappers and reducers in a more dynamic way, >>> depending on the size of the input? >>> * JVM reuse, one jvm per job instead of one per task? >>> >>> Any other ideas? >>> >>> /Johan >> >