Re: joining two large files in hadoop

Christian Ulrik Søttrup Sun, 05 Apr 2009 07:43:53 -0700

jason hadoop wrote:

This is discussed in chapter 8 of my book.

What book? Is it out?

In short,
If both data sets are:

   - in same key order
   - partitioned with the same partitioner,
   - the input format of each data set is the same, (necessary for this
   simple example only)

A map side join will present all the key value pairs of each partition, to a
single map task, in key order,
Path dir1 == the directory containing the part-XXXXX files for data set 1
Path dir2 == The directory containing the part-XXXXX files for data set 2
and use CompositeInputFormat.compose to build the join statement

set the InputFormat to CompositeInputFormat,
conf.setInputFormat(CompositeInputFormat.class);

String joinStatement = CompositeInputFormat.compose("inner", dir1, dir2);
conf.set('mapred.join.expr", joinStatement);

The value classfor your map method will be TupleWritable
In the map method,

   - value.has(x) indicates if the Xth ordinal data set has a value for this
   key
   - value.get(x) returns the value from the Xth ordinal data set for this
   key
   - value.size() returns the number of data sets in the join

In our example, dir1 would be ordinal 0, and dir2 would be ordinal 1.

The partitioner is normally used for the reduce step but here it will beused already at the mapper stage?


Basically my files look like:
id<tab>matrix
id2<tab>anothermatrix
and
id<tab>vector1
id<tab>vector2
id2<tab>vector3

id is just an integer and there is only one matrix but many vectors tiedto the same id.

I just want the values from both files that has the same id.

Do I need a partitioner in this case? What happens if the file is splitinto blocks such that two blocks

contain entries with the same key?

Am I right if what happens is that using the example above the mapperwill be called three times with:

key=id   tuple=(matrix,vector1)
key=id   tuple=(matrix,vector2)
key=id2 tuple=(anothermatix,vector3)

cheers,
Christian

Re: joining two large files in hadoop

Reply via email to