On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:

I have a question regarding multiple output files that get produced as
a result of using multiple reduce tasks for a job (as opposed to only
one). If I'm using a custom writable and thus writing to a sequence
output, am I gauranteed that all of the day for a particular key will
appear in a single output file (e.g., part-0000), or is it possible
that the values could be split across multiple part-xxxx files?

Each key will be processed by exactly one reduce. All of the keys to each reduce will be sorted. The application can define a Partitioner that picks the reduce for each key. The default one uses key.hashCode() % numReduces, which is usually balanced. If your key had both a date and time and you wanted to have all of the transactions for a given day in the same reduce, you could do:

class MyKey {
  Date date;
  Time time;
}

and use a partitioner like:

public class MyPartitioner extends Partitioner {
  public int getPartition(MyKey key, MyValue value,
                          int numReduceTasks) {
    return (key.date.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

Of course the risk is that you may have very unbalanced reduce sizes, depending on your data.

At the
end of the job I'm using the sequence file reader to read each custom
key/writable pair from each output file. Is it possible that the same
key could appear in multiple output files?

No.

-- Owen

Reply via email to