Re: Multiple output files

Owen O'Malley Sat, 06 Sep 2008 11:44:56 -0700


On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:

I have a question regarding multiple output files that get produced as
a result of using multiple reduce tasks for a job (as opposed to only
one). If I'm using a custom writable and thus writing to a sequence
output, am I gauranteed that all of the day for a particular key will
appear in a single output file (e.g., part-0000), or is it possible
that the values could be split across multiple part-xxxx files?

Each key will be processed by exactly one reduce. All of the keys toeach reduce will be sorted. The application can define a Partitionerthat picks the reduce for each key. The default one useskey.hashCode() % numReduces, which is usually balanced. If your keyhad both a date and time and you wanted to have all of thetransactions for a given day in the same reduce, you could do:


class MyKey {
  Date date;
  Time time;
}

and use a partitioner like:

public class MyPartitioner extends Partitioner {
  public int getPartition(MyKey key, MyValue value,
                          int numReduceTasks) {
    return (key.date.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

Of course the risk is that you may have very unbalanced reduce sizes,depending on your data.

At the
end of the job I'm using the sequence file reader to read each custom
key/writable pair from each output file. Is it possible that the same
key could appear in multiple output files?


No.

-- Owen

Re: Multiple output files

Reply via email to