On Oct 18, 2007, at 5:04 PM, Lance Amundsen wrote:
You said arbitrary.. maybe I missed something. Can I construct a
getSplits() method that chunks up the file however I want?
Yes. The application specifies an InputFormat class, which has a
getSplits method that returns a list of InputSplits. The "standard"
input formats extends FileInputFormat, which has the behavior we have
been describing. However, your InputFormat can generate InputSplits
however it wants. For an example of an unusual variation, look at the
RandomWriter example. It creates inputs splits that aren't based on
any files at all. It just creates a split for each map that it wants.
I assumed I
needed to return a split map that corresponded to key, value
boundaries,
SequenceFileInputFormat and TextInputFormat don't need the splits to
match the record boundaries. They both start at the first record
after the split's start offset and continue to the next record after
the split's end. TextInputFormat always treats records as "/n" and
SequenceFile uses constant blocks of bytes "sync markers" to find
record boundaries.
1 file, 1000 records, 1000 maps requested yields 43 actual maps
1 file, 10,000 records, 10,000 maps requested yields 430 actual maps
I don't understand how this is happening. What is the data size,
block size, and minimum split size in your job.
In all of these cases I can only get 2 task/node running at the same
time.... once in a while 3 run.... even though I have specified a
higher
number to be allowed.
Are you maps finishing quickly (< 20 seconds)?
I want 1 map per record, from one file, for any number of records,
and I
want it guaranteed. Later I may want 10 records, or a 100, but now
I right
now I want to force a one record per mapper relationship, an I do
not want
to pay the file creation overhead of, say 1000 files, just to get 1000
maps.
That is completely doable. Although to make it perform well, you
either need an index from row number to file offset or fixed width
records... In any case, you'll need to write your own InputFormat.
-- Owen