Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by OlgaN:

New page:
= Join Framework =

== Objective == 

This document provides a comprehensive view of performing joins in Pig. By 
=JOIN= here we mean traditional inner/outer =SQL= joins which in Pig are 
realized via =COGROUP= followed by =flatten= of the relations.

Some of the approaches described in this document can also be applied to 
=CROSS= and =GROUP= as well. 

== Joins ==

Currently, Pig running on top of Hadoop executes all joins in the same way. 
During the map stage, the data from each relation is annotated with the index 
of that relation. Then, the data is sorted and partitioned by the join key and 
provided to the reducer. This is similar to SQL's =hash join=. In the next 
generation Pig (currently on types branch), the data from the same relation is 
guaranteed to be continuous for the same key. This is to allow optimization 
that only keep =N-1= relations in memory. (Unfortunately, we did not see the 
expected speedup when this optimization was tried - investigation is still in 

In some situations, more efficient join implementations can be constructed if 
more is known about the data of the relations. They are described in the 

=== Pre-partitioned Join (PPJ) === 

This join type takes advantage of the fact that the data of all relations is 
already partition by the join key or its prefix which means that the join can 
be done completely independently on separate nodes. It further helps if the 
data is sorted on the key; otherwise it might have to get sorted before the 

In the case of =Hadoop=, this means that the join can be done in a =Map= 
avoiding =SORT/SHUFFLE/REDUCE= stages. The performance would be even better if 
the partitions for the same key ranges were collocated on the same nodes and if 
the computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.

Note that GROUP can take advantage of this knowledge as well.

[Discussion of different data layout options.]

=== Fragment Replicate Join (FRJ) === 

This join type takes advantage of the fact that N-1 relations in the join are 
very small and can fit into main memory of each node. In this case, the small 
tables can be copied onto all the nodes and be joined with the data from the 
larger table. This saves the cost of sorting and partitioning the large table. 
For Hadoop this means that the join can happen on the map side. 

The data coming out of the join is not guaranteed to be sorted on the join key 
which could cause problems for queries that follow join by =GROUP= or =ORDER 
BY= on the prefix of the join key. This should be taken into account when 
choosing join type.

If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.

Note that CROSS can take advantage of this approach as well.

=== Indexed Join (IJ) ===

This join type takes advantage of the fact that one or more tables 
participating in the join have index on the group by key or its prefix. This is 
similar in structure to FRJ join but could be even more efficient since 
processing time can be proportional to the size of the non-indexed and 
hopefully smaller table.

In Hadoop, this will also result in a map side join. 

Currently neither Pig nor Hadoop have indexing structure. So getting to this 
point might take some time and needs some compelling use cases to make the 

Reply via email to