You said arbitrary.. maybe I missed something. Can I construct a getSplits() method that chunks up the file however I want? I assumed I needed to return a split map that corresponded to key, value boundaries, which I am having troubling doing since the input format for the value can change (I am now using ObjectWritable for example.... GenericWritable makes this impossible I think). Then there is the file header to take into account..... but if I am making this too complicated, let me know
Setting the number of maps as a method to force record/file/mapper relationships appears dicey at best. This is the sort of stuff I am seeing in my 9 node setup: 1 file, 1000 records, 1000 maps requested yields 43 actual maps 1 file, 10,000 records, 10,000 maps requested yields 430 actual maps In all of these cases I can only get 2 task/node running at the same time.... once in a while 3 run.... even though I have specified a higher number to be allowed. I want 1 map per record, from one file, for any number of records, and I want it guaranteed. Later I may want 10 records, or a 100, but now I right now I want to force a one record per mapper relationship, an I do not want to pay the file creation overhead of, say 1000 files, just to get 1000 maps. BTW, I have started working on my own InputFormat and InputFileClasses as well.... and these questions are helping me with context, but tactically I am just trying to understand the Hadoop mapper startup overhead with the goals of a) reducing it, and b) making the overhead stay constant (flat) out to n mappers on m nodes. Lance IBM Software Group - Strategy Performance Architect High-Performance On Demand Solutions (HiPODS) 650-678-8425 cell Doug Cutting <[EMAIL PROTECTED] rg> To hadoop-user@lucene.apache.org 10/18/2007 04:04 cc PM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base [EMAIL PROTECTED] e.apache.org Lance Amundsen wrote: > Thx, I'll give that a try. Seems to me a method to tell hadoop to split a > file every "n" key/value pairs would be logical. Or maybe a > createSplitBoundary when appending key/value records? Splits should not require examining the data: that's not scalable. So they're instead on arbitrary byte boundaries. > I just want a way, and not a real complex way, of directing the # of maps > and the breakdown of records going to them. Creating a separate file per > record group is too slow for my purposes. Just set the number of map tasks. That should mostly do what you want in this case. If you want finer-grained control, implement your own InputFormat. Doug