Your understanding is correct. The framework doesn't do anything to
align input splits across datasets. In the situation you describe-
where one can't seek among key groups in the input data- it often
makes sense to disable splitting of the individual files by setting
the min split size to Integer.MAX_VALUE.

The description probably shouldn't use "partitioned", since that
implies that the partitioner is sufficient. -C

On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote:
> I read about CompositeInputFormat and how it allows one to join two datasets
> together as long as those datasets were sorted and partitioned the same way.
> Ok i think i get it, but something bothers me. It is suggested that two
> datasets are "sorted and partitioned the same way" if they were both outputs
> from the mapreduce process with the same number of reducers with the same
> sorting & partitioning. However, something like CompositeInputFormat depends
> on the splits lining up, and two datasets going through the same reducer
> setup doesn't guarantee that at all. Splits after all are based on stuff
> like data size in MBs, and the reducers do not control that this will be the
> same. part-00007 for dataset 1 could be a different size (and have different
> number of splits) than part-00007 for dataset 2, even if they have the same
> keys and are sorted the same way. So now CompositeInputFormat would not
> work. Is this correct?

Reply via email to