On Oct 18, 2007, at 5:04 PM, Lance Amundsen wrote:

You said arbitrary.. maybe I missed something.  Can I construct a
getSplits() method that chunks up the file however I want?

Yes. The application specifies an InputFormat class, which has a getSplits method that returns a list of InputSplits. The "standard" input formats extends FileInputFormat, which has the behavior we have been describing. However, your InputFormat can generate InputSplits however it wants. For an example of an unusual variation, look at the RandomWriter example. It creates inputs splits that aren't based on any files at all. It just creates a split for each map that it wants.

  I assumed I
needed to return a split map that corresponded to key, value boundaries,

SequenceFileInputFormat and TextInputFormat don't need the splits to match the record boundaries. They both start at the first record after the split's start offset and continue to the next record after the split's end. TextInputFormat always treats records as "/n" and SequenceFile uses constant blocks of bytes "sync markers" to find record boundaries.

1 file, 1000 records, 1000 maps requested yields 43 actual maps
1 file, 10,000 records,  10,000 maps requested yields 430 actual maps

I don't understand how this is happening. What is the data size, block size, and minimum split size in your job.

In all of these cases I can only get 2 task/node running at the same
time.... once in a while 3 run.... even though I have specified a higher
number to be allowed.

Are you maps finishing quickly (< 20 seconds)?

I want 1 map per record, from one file, for any number of records, and I want it guaranteed. Later I may want 10 records, or a 100, but now I right now I want to force a one record per mapper relationship, an I do not want
to pay the file creation overhead of, say 1000 files, just to get 1000
maps.

That is completely doable. Although to make it perform well, you either need an index from row number to file offset or fixed width records... In any case, you'll need to write your own InputFormat.

-- Owen

Reply via email to