thanks for that answer. makes sense. koert

On Tue, Apr 10, 2012 at 1:33 PM, Chris Douglas <cdoug...@apache.org> wrote:

> Your understanding is correct. The framework doesn't do anything to
> align input splits across datasets. In the situation you describe-
> where one can't seek among key groups in the input data- it often
> makes sense to disable splitting of the individual files by setting
> the min split size to Integer.MAX_VALUE.
>
> The description probably shouldn't use "partitioned", since that
> implies that the partitioner is sufficient. -C
>
> On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote:
> > I read about CompositeInputFormat and how it allows one to join two
> datasets
> > together as long as those datasets were sorted and partitioned the same
> way.
> > Ok i think i get it, but something bothers me. It is suggested that two
> > datasets are "sorted and partitioned the same way" if they were both
> outputs
> > from the mapreduce process with the same number of reducers with the same
> > sorting & partitioning. However, something like CompositeInputFormat
> depends
> > on the splits lining up, and two datasets going through the same reducer
> > setup doesn't guarantee that at all. Splits after all are based on stuff
> > like data size in MBs, and the reducers do not control that this will be
> the
> > same. part-00007 for dataset 1 could be a different size (and have
> different
> > number of splits) than part-00007 for dataset 2, even if they have the
> same
> > keys and are sorted the same way. So now CompositeInputFormat would not
> > work. Is this correct?
>

Reply via email to