Your understanding is correct. The framework doesn't do anything to align input splits across datasets. In the situation you describe- where one can't seek among key groups in the input data- it often makes sense to disable splitting of the individual files by setting the min split size to Integer.MAX_VALUE.
The description probably shouldn't use "partitioned", since that implies that the partitioner is sufficient. -C On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote: > I read about CompositeInputFormat and how it allows one to join two datasets > together as long as those datasets were sorted and partitioned the same way. > Ok i think i get it, but something bothers me. It is suggested that two > datasets are "sorted and partitioned the same way" if they were both outputs > from the mapreduce process with the same number of reducers with the same > sorting & partitioning. However, something like CompositeInputFormat depends > on the splits lining up, and two datasets going through the same reducer > setup doesn't guarantee that at all. Splits after all are based on stuff > like data size in MBs, and the reducers do not control that this will be the > same. part-00007 for dataset 1 could be a different size (and have different > number of splits) than part-00007 for dataset 2, even if they have the same > keys and are sorted the same way. So now CompositeInputFormat would not > work. Is this correct?