RE: Output filename generation?

Goel, Ankur Sun, 20 Apr 2008 22:26:47 -0700

One technique that I use is to set the output dir to a "tmp" 
directory, after the job completes move them with a new name 
to the actual job output dir and delete the "tmp" dir.


Another technique is instead of using the OutputCollector
initialize and use your own outputWriters in Reducer.

I use this a lot since I need to collect different keys
in different files depending upo their value prefix 
(fixed in Map phase for identification in reduce phase)

Here's some sample code from Reducer for illustration

private Map<String, RecordWriter<Text, Text>> writers = new
HashMap<String, RecordWriter<Text, Text>>();

private void initOutputWriters(Text key, Reporter reporter) {

    try {
        int numReduceTasks = this.job.getNumReduceTasks();
        int index = (key.hashCode() & Integer.MAX_VALUE) %
numReduceTasks;
        for(String writerPath : JobRunner.properties) {
          RecordWriter<Text, Text> writer = 
            getTextOutputWriter(new Path(writerPath), writerPath + "_" +
index + ".log", reporter);
          writers.put(writerPath, writer);
        }
        initialized = true;
    } catch (IOException ioe) {
      LOG.error("Unable to initialize writers: " + ioe);
    }
  }

public void reduce(Text key, Iterator<Text> values,
      OutputCollector<Text, Text> output, Reporter reporter) throws
IOException {

    if(!initialized)
      initOutputWriters(key, reporter);

    RecordWriter<Text, Text> outputWriter = null;
     String skey = key.toString().toLowerCase();
     String prefix = getPrefix(skey);
     outputWriter = writers.get(prefix);
     ...
     // do some computatin on keys and values
     ...
    outputWriter.write(key, value);

}

-----Original Message-----
From: Amar Kamat [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 21, 2008 10:42 AM
To: [email protected]
Subject: Re: Output filename generation?

pi song wrote:
> Dear hadoop mailling-list,
>
> Is there a way to control output filename generation? A sample use 
> case is when I want 2 MapReduce jobs to output to the same directory.
>   
I think you need to write your own output format (see
http://tinyurl.com/4aszgk). Look at OutputFormat.getRecordWriter(). The
parameter *name* is what determines the output filename. One easy way
would be to append the job-name to this *name* in
OutputFormat.getRecordWriter().
Something like
public RecordWriter<WritableComparable, Writable>
getRecordWriter(FileSystem ignored, JobConf job, String name,
Progressable progress) throws IOException { name = name + "_" +
job.getJobName(); //rest of the code .. taken from Hadoop-0.16.3 } Amar
> Pi
>
>

RE: Output filename generation?

Reply via email to