Ahhh... There is an old saying for this. I think you are pulling fly specks out of pepper.
Unless your input format is very, very strange, doing the split again for two jobs does, indeed, lead to some small inefficiency, but this cost should be so low compared to other inefficiencies that you are wasting your time to try to optimize that away. Remember, you don't know where the maps will execute so getting the split to the correct nodes will be a nightmare. If your splitting is actually so expensive that you can measure it, then you should consider changing formats. This is analogous to having a single gzipped input file. Splitting such a file involves reading the file from the beginning because gzip is a stream compression algorithm. There are a few proposals going around to optimize that by concatenating gzip files with special marker files in between, but the real answer is to either not use gzipped input files or to split the files before gzipping. On 3/12/08 10:38 AM, "Prasan Ary" <[EMAIL PROTECTED]> wrote: > So basically I am splitting the XML twice, and getting same split each time. > It would be nice if I could split the XML once, and send those splits to Map > of Job_1 and Job_2.
