question about org.apache.hadoop.mapred.join

Koert Kuipers Tue, 10 Apr 2012 08:11:39 -0700

I read about CompositeInputFormat and how it allows one to join two
datasets together as long as those datasets were sorted and partitioned the
same way.
Ok i think i get it, but something bothers me. It is suggested that two
datasets are "sorted and partitioned the same way" if they were both
outputs from the mapreduce process with the same number of reducers with
the same sorting & partitioning. However, something like
CompositeInputFormat depends on the splits lining up, and two datasets
going through the same reducer setup doesn't guarantee that at all. Splits
after all are based on stuff like data size in MBs, and the reducers do not
control that this will be the same. part-00007 for dataset 1 could be a
different size (and have different number of splits) than part-00007 for
dataset 2, even if they have the same keys and are sorted the same way. So
now CompositeInputFormat would not work. Is this correct?

question about org.apache.hadoop.mapred.join

Reply via email to