Ahhh...

There is an old saying for this.  I think you are pulling fly specks out of
pepper.

Unless your input format is very, very strange, doing the split again for
two jobs does, indeed, lead to some small inefficiency, but this cost should
be so low compared to other inefficiencies that you are wasting your time to
try to optimize that away.  Remember, you don't know where the maps will
execute so getting the split to the correct nodes will be a nightmare.

If your splitting is actually so expensive that you can measure it, then you
should consider changing formats.  This is analogous to having a single
gzipped input file.  Splitting such a file involves reading the file from
the beginning because gzip is a stream compression algorithm.  There are a
few proposals going around to optimize that by concatenating gzip files with
special marker files in between, but the real answer is to either not use
gzipped input files or to split the files before gzipping.


On 3/12/08 10:38 AM, "Prasan Ary" <[EMAIL PROTECTED]> wrote:

>   So basically I am splitting the XML twice, and getting same split each time.
> It would be nice if I could split the XML once, and send those splits to Map
> of Job_1 and Job_2.

Reply via email to