thanks for that answer. makes sense. koert On Tue, Apr 10, 2012 at 1:33 PM, Chris Douglas <cdoug...@apache.org> wrote:
> Your understanding is correct. The framework doesn't do anything to > align input splits across datasets. In the situation you describe- > where one can't seek among key groups in the input data- it often > makes sense to disable splitting of the individual files by setting > the min split size to Integer.MAX_VALUE. > > The description probably shouldn't use "partitioned", since that > implies that the partitioner is sufficient. -C > > On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers <ko...@tresata.com> wrote: > > I read about CompositeInputFormat and how it allows one to join two > datasets > > together as long as those datasets were sorted and partitioned the same > way. > > Ok i think i get it, but something bothers me. It is suggested that two > > datasets are "sorted and partitioned the same way" if they were both > outputs > > from the mapreduce process with the same number of reducers with the same > > sorting & partitioning. However, something like CompositeInputFormat > depends > > on the splits lining up, and two datasets going through the same reducer > > setup doesn't guarantee that at all. Splits after all are based on stuff > > like data size in MBs, and the reducers do not control that this will be > the > > same. part-00007 for dataset 1 could be a different size (and have > different > > number of splits) than part-00007 for dataset 2, even if they have the > same > > keys and are sorted the same way. So now CompositeInputFormat would not > > work. Is this correct? >