For right now, I am testing boundary conditions related to startup costs. I want to build a mapper interface that performs relatively flatly WRT numbers of mappers. My goal is to dramatically improve startup costs for one mapper, and then make sure that that startup cost does not increase dramatically as node, maps, and records are increased.
Example: let's say I have 10K one second jobs and I want the whole thing to run 2 seconds. I currently see no way for Hadoop to achieve this, But I also see how to get there, and this level of granularity would be one of the requirements..... I believe. Lance IBM Software Group - Strategy Performance Architect High-Performance On Demand Solutions (HiPODS) 650-678-8425 cell Ted Dunning <[EMAIL PROTECTED] m> To <hadoop-user@lucene.apache.org> 10/17/2007 11:34 cc AM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base [EMAIL PROTECTED] e.apache.org On 10/17/07 10:37 AM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote: > 1 file per map, 1 record per file, isSplitable(true or false): yields 1 > record per mapper Yes. > 1 file total, n records, isSplitable(true): Yields variable n records per > variable m mappers Yes. > 1 file total, n records, isSplitable(false): Yields n records into 1 > mapper Yes. > What I am immediately looking for is a way to do: > > 1 file total, n records, isSplitable(true): Yields 1 record into n mappers > > But ultimately need to control fully control the file/record distributions. Why in the world do you need this level of control? Isn't that the point of frameworks like Hadoop? (to avoid the need for this)