You said arbitrary.. maybe I missed something. Can I construct a
getSplits() method that chunks up the file however I want? I assumed I
needed to return a split map that corresponded to key, value boundaries,
which I am having troubling doing since the input format for the value can
change (I am now using ObjectWritable for example.... GenericWritable makes
this impossible I think). Then there is the file header to take into
account..... but if I am making this too complicated, let me know
Setting the number of maps as a method to force record/file/mapper
relationships appears dicey at best. This is the sort of stuff I am seeing
in my 9 node setup:
1 file, 1000 records, 1000 maps requested yields 43 actual maps
1 file, 10,000 records, 10,000 maps requested yields 430 actual maps
In all of these cases I can only get 2 task/node running at the same
time.... once in a while 3 run.... even though I have specified a higher
number to be allowed.
I want 1 map per record, from one file, for any number of records, and I
want it guaranteed. Later I may want 10 records, or a 100, but now I right
now I want to force a one record per mapper relationship, an I do not want
to pay the file creation overhead of, say 1000 files, just to get 1000
maps.
BTW, I have started working on my own InputFormat and InputFileClasses as
well.... and these questions are helping me with context, but tactically I
am just trying to understand the Hadoop mapper startup overhead with the
goals of a) reducing it, and b) making the overhead stay constant (flat)
out to n mappers on m nodes.
Lance
IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)
650-678-8425 cell
Doug Cutting
<[EMAIL PROTECTED]
rg> To
[email protected]
10/18/2007 04:04 cc
PM
Subject
Re: InputFiles, Splits, Maps, Tasks
Please respond to Questions 1.3 Base
[EMAIL PROTECTED]
e.apache.org
Lance Amundsen wrote:
> Thx, I'll give that a try. Seems to me a method to tell hadoop to split
a
> file every "n" key/value pairs would be logical. Or maybe a
> createSplitBoundary when appending key/value records?
Splits should not require examining the data: that's not scalable. So
they're instead on arbitrary byte boundaries.
> I just want a way, and not a real complex way, of directing the # of maps
> and the breakdown of records going to them. Creating a separate file per
> record group is too slow for my purposes.
Just set the number of map tasks. That should mostly do what you want
in this case. If you want finer-grained control, implement your own
InputFormat.
Doug