OutputFormat Theory Question

Berry, Matt Thu, 19 Jul 2012 09:25:27 -0700

>From what I gather about how Map Reduce operates, there isn't really any 
>functional difference between whether a single OutputFormat object is 
>initialized on a central node or if each reducer task initializes its own 
>OutputFormat object. What I would like to  know however, is the relationship 
>between the records that are passed to the OutputFormat from the reducers. 
>Take the case of a sorting MapReduce job, where the mapper and reducer are 
>both identity functions. In this setup, I would expect that the records being 
>passed to the OutputFormat from the reducer are sorted and are arriving 
>in-order.


A simplified version of my use-case is to sort a large number of records, and 
then write all the ones that start with A to a file named A, B to B, etc. Due 
to the fact that each file can only be opened for writing once, it is very 
important in this use case to know if the records arrive at the OutputFormat 
in-order so I know it is safe to close file A when I encounter a record that 
belongs in B.

Sincerely,
Matthew Berry

OutputFormat Theory Question

Reply via email to