For right now, I am testing boundary conditions related to startup costs.
I want to build a mapper interface that performs relatively flatly WRT
numbers of mappers. My goal is to dramatically improve startup costs for
one mapper, and then make sure that that startup cost does not increase
dramatically as node, maps, and records are increased.
Example: let's say I have 10K one second jobs and I want the whole thing to
run 2 seconds. I currently see no way for Hadoop to achieve this, But I
also see how to get there, and this level of granularity would be one of
the requirements..... I believe.
Lance
IBM Software Group - Strategy
Performance Architect
High-Performance On Demand Solutions (HiPODS)
650-678-8425 cell
Ted Dunning
<[EMAIL PROTECTED]
m> To
<[email protected]>
10/17/2007 11:34 cc
AM
Subject
Re: InputFiles, Splits, Maps, Tasks
Please respond to Questions 1.3 Base
[EMAIL PROTECTED]
e.apache.org
On 10/17/07 10:37 AM, "Lance Amundsen" <[EMAIL PROTECTED]> wrote:
> 1 file per map, 1 record per file, isSplitable(true or false): yields 1
> record per mapper
Yes.
> 1 file total, n records, isSplitable(true): Yields variable n records
per
> variable m mappers
Yes.
> 1 file total, n records, isSplitable(false): Yields n records into 1
> mapper
Yes.
> What I am immediately looking for is a way to do:
>
> 1 file total, n records, isSplitable(true): Yields 1 record into n
mappers
>
> But ultimately need to control fully control the file/record
distributions.
Why in the world do you need this level of control? Isn't that the point
of
frameworks like Hadoop? (to avoid the need for this)