Thanks for the prompt response Harsh ! The job is an indexing job. Each Mapper emits a small index and the Reducer merges all of those indexes together. The Mappers output the index as a Writable which serializes it. I guess I could write the Reducer's function as a separate class as you suggest, but then I'll need to write a custom OutputFormat that will put those indexes on HDFS or somewhere?
That complicates matters for me -- currently, when this job is run as part of a sequence of jobs, I can guarantee that if the job succeeds, then the indexes are successfully merged, and if it fails, the job should be restarted. While that can be achieved with a separate FS-using program as you suggest, it complicates matters. Is my scenario that extreme? Would you say the common scenario for Hadoop are jobs that output tiny objects between Mappers and Reducers? Would this work much better if I work w/ several Reducers? I'm not sure it will because the problem lies, IMO, in Hadoop allocating large consecutive chunks of RAM in my case, instead of trying to either stream it or break it down to smaller chunks. Is there absolutely no way to bypass the shuffle + sort phases? I don't mind writing some classes if that's what it takes ... Shai On Thu, Apr 14, 2011 at 9:50 PM, Harsh J <[email protected]> wrote: > Hello Shai, > > On Fri, Apr 15, 2011 at 12:01 AM, Shai Erera <[email protected]> wrote: > > Hi > > I'm running on Hadoop 0.20.2 and I have a job with the following nature: > > * Mapper outputs very large records (50 to 200 MB) > > * Reducer (single) merges all those records together > > * Map output key is a constant (could be a NullWritable, but currently > it's > > a LongWritable(1)) > > * Reducer doesn't care about the keys at all > > If I understand right, your single reducer's only work is to merge > your multiple map's large record emits, and nothing else (It does not > have 'keys' to worry about), correct? > > Why not do this with a normal FS-using program that opens a single > file to write out map-materialized output files from a Map-only job to > merge them? > > -- > Harsh J >
