Is there a requirement for the final reduce file to be sorted? If not, wouldn't a map only job ( + a combiner, ) and a merge only job provide the answer?
Raj >________________________________ > From: Michael Segel <michael_se...@hotmail.com> >To: common-user@hadoop.apache.org >Sent: Tuesday, July 31, 2012 5:24 AM >Subject: Re: Merge Reducers Output > >You really don't want to run a single reducer unless you know that you don't >have a lot of mappers. > >As long as the output data types and structure are the same as the input, you >can run your code as the combiner, and then run it again as the reducer. >Problem solved with one or two lines of code. >If your input and output don't match, then you can use the existing code as a >combiner, and then write a new reducer. It could as easily be an identity >reducer too. (Don't know the exact problem.) > >So here's a silly question. Why wouldn't you want to run a combiner? > > >On Jul 31, 2012, at 12:08 AM, Jay Vyas <jayunit...@gmail.com> wrote: > >> Its not clear to me that you need custom input formats.... >> >> 1) Getmerge might work or >> >> 2) Simply run a SINGLE reducer job (have mappers output static final int >> key=1, or specify numReducers=1). >> >> In this case, only one reducer will be called, and it will read through all >> the values. >> >> On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS <bejoy.had...@gmail.com> wrote: >> >>> Hi >>> >>> Why not use 'hadoop fs -getMerge <outputFolderInHdfs> >>> <targetFileNameInLfs>' while copying files out of hdfs for the end users to >>> consume. This will merge all the files in 'outputFolderInHdfs' into one >>> file and put it in lfs. >>> >>> Regards >>> Bejoy KS >>> >>> Sent from handheld, please excuse typos. >>> >>> -----Original Message----- >>> From: Michael Segel <michael_se...@hotmail.com> >>> Date: Mon, 30 Jul 2012 21:08:22 >>> To: <common-user@hadoop.apache.org> >>> Reply-To: common-user@hadoop.apache.org >>> Subject: Re: Merge Reducers Output >>> >>> Why not use a combiner? >>> >>> On Jul 30, 2012, at 7:59 PM, Mike S wrote: >>> >>>> Liked asked several times, I need to merge my reducers output files. >>>> Imagine I have many reducers which will generate 200 files. Now to >>>> merge them together, I have written another map reduce job where each >>>> mapper read a complete file in full in memory, and output that and >>>> then only one reducer has to merge them together. To do so, I had to >>>> write a custom fileinputreader that reads the complete file into >>>> memory and then another custom fileoutputfileformat to append the each >>>> reducer item bytes together. this how my mapper and reducers looks >>>> like >>>> >>>> public static class MapClass extends Mapper<NullWritable, >>>> BytesWritable, IntWritable, BytesWritable> >>>> { >>>> @Override >>>> public void map(NullWritable key, BytesWritable value, >>> Context >>>> context) throws IOException, InterruptedException >>>> { >>>> context.write(key, value); >>>> } >>>> } >>>> >>>> public static class Reduce extends Reducer<NullWritable, >>>> BytesWritable, NullWritable, BytesWritable> >>>> { >>>> @Override >>>> public void reduce(NullWritable key, >>> Iterable<BytesWritable> values, >>>> Context context) throws IOException, InterruptedException >>>> { >>>> for (BytesWritable value : values) >>>> { >>>> context.write(NullWritable.get(), value); >>>> } >>>> } >>>> } >>>> >>>> I still have to have one reducers and that is a bottle neck. Please >>>> note that I must do this merging as the users of my MR job are outside >>>> my hadoop environment and the result as one file. >>>> >>>> Is there better way to merge reducers output files? >>>> >>> >>> >> >> >> -- >> Jay Vyas >> MMSB/UCHC > > > >