I have a slightly modified Text Output Format that essentially writes each key 
into its own file. It operates off the premise that my reducer is an identity 
function and it emits each record one-by-one in the order they come from the 
collection. Because the records are emitted in order from the reducer, I can 
maintain one open output file and close it when a new key appears. The reason I 
am doing it like this instead of using MultipleOutputs is that I am locked into 
hadoop 0.20.205.0. 

The problem I am having is that I am randomly getting IOExceptions due to 
opening an existing file. There are two ways I imagine this could happen. (1) 
Reducer 1  emits a record for key A and then Reducer 2 emits a record for Key 
A. I'm certain this is not the case as the keys should all group together. (2) 
The records are emitted out of order from a single reducer (AAAA BBBB A) in 
which case the reducer would try to open A again.

What is perplexing me is that in addition to the output files for each key, 
each output format opens a log file. I am seeing an exception propagate out 
from the reducer, but no such error appears in my log file. Some sample code 
follows to clarify.

class ModifiedTextOutputFormat {

  public ModifiedTextOutputFormat() {
    createLogFile();
  }

  protected createOutputFile(name) {
    try {
     fs.create(name);
    } catch (Throwable t) {
      logFile.writeBytes("Information about the error");  // Here I log the 
error (although it is missing later)
      closeLogFile(); // Here I close that file to be certain the last line is 
flushed
      throw new IOException("Information about the error",t); // Here I throw 
an exception, which appears on stderr
    }
  }

  public write(Key k, Value v) {
    if(!k.toString.equals(current)) { 
     outputFile.close();
     createOutputFile(k.toString); 
   }
    outputFile.writeBytes(v.toString());
  }
}

Reply via email to