You can tune number of map tasks/node with the config variable "mapred.tasktracker.tasks.maximum" on the jobtracker (there is a patch to make it configurable on the tasktracker: see https://issues.apache.org/jira/browse/HADOOP-1245).
-Michael On 10/22/07 5:53 PM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote: OK, I spent forever playing with over-riding SequenceFileInputFormat behavior, and attempting my own completely different input format (extending SFIF)... but I finally just decided to download tha Hadoop source and see exactly what the heck it is doing. It turns out that there is a constant value in SequnceFile of SYNC_INTERVAL and that the SFIF constructor calls setMinSplitSize with this value (2000). So getting a split size less than 2000 was impossible... so I just hard coded a splitsize equal to my record size in FileInputFormat and now I am getting exactly what I want, 1 map invocation per record per "one" input file. Next I want to increase the concurrent # of tasks being executed for each node... currently it seems like 2 or 3 is the upper limit (at least on the earlier binaries I was running). Any comments appreciated.... searching the code now. Lance IBM Software Group - Strategy Performance Architect High-Performance On Demand Solutions (HiPODS) 650-678-8425 cell "Owen O'Malley" <[EMAIL PROTECTED] m> To hadoop-user@lucene.apache.org 10/18/2007 09:44 cc PM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base [EMAIL PROTECTED] e.apache.org On Oct 18, 2007, at 5:04 PM, Lance Amundsen wrote: > You said arbitrary.. maybe I missed something. Can I construct a > getSplits() method that chunks up the file however I want? Yes. The application specifies an InputFormat class, which has a getSplits method that returns a list of InputSplits. The "standard" input formats extends FileInputFormat, which has the behavior we have been describing. However, your InputFormat can generate InputSplits however it wants. For an example of an unusual variation, look at the RandomWriter example. It creates inputs splits that aren't based on any files at all. It just creates a split for each map that it wants. > I assumed I > needed to return a split map that corresponded to key, value > boundaries, SequenceFileInputFormat and TextInputFormat don't need the splits to match the record boundaries. They both start at the first record after the split's start offset and continue to the next record after the split's end. TextInputFormat always treats records as "/n" and SequenceFile uses constant blocks of bytes "sync markers" to find record boundaries. > 1 file, 1000 records, 1000 maps requested yields 43 actual maps > 1 file, 10,000 records, 10,000 maps requested yields 430 actual maps I don't understand how this is happening. What is the data size, block size, and minimum split size in your job. > In all of these cases I can only get 2 task/node running at the same > time.... once in a while 3 run.... even though I have specified a > higher > number to be allowed. Are you maps finishing quickly (< 20 seconds)? > I want 1 map per record, from one file, for any number of records, > and I > want it guaranteed. Later I may want 10 records, or a 100, but now > I right > now I want to force a one record per mapper relationship, an I do > not want > to pay the file creation overhead of, say 1000 files, just to get 1000 > maps. That is completely doable. Although to make it perform well, you either need an index from row number to file offset or fixed width records... In any case, you'll need to write your own InputFormat. -- Owen