Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/JoinFramework

------------------------------------------------------------------------------
  
  == Objective ==
  
- This document provides a comprehensive view of performing joins in Pig. By 
JOIN here we mean traditional inner/outer SQL joins which in Pig are realized 
via COGROUP followed by flatten of the relations.
+ This document provides a comprehensive view of performing joins in Pig. By 
`JOIN` here we mean traditional inner/outer `SQL` joins which in Pig are 
realized via `COGROUP` followed by flatten of the relations.
  
- Some of the approaches described in this document can also be applied to 
CROSS and GROUP as well. 
+ Some of the approaches described in this document can also be applied to 
`CROSS` and `GROUP` as well. 
  
  == Joins ==
  
@@ -20, +20 @@

  
  In the case of Hadoop, this means that the join can be done in a Map avoiding 
SORT/SHUFFLE/REDUCE stages. The performance would be even better if the 
partitions for the same key ranges were collocated on the same nodes and if the 
computation was scheduled to run on this nodes. However, for now this is 
outside of Pig's control.
  
- Note that GROUP can take advantage of this knowledge as well.
+ Note that `GROUP` can take advantage of this knowledge as well.
  
  [Discussion of different data layout options.]
  
  === Fragment Replicate Join (FRJ) ===
  
- This join type takes advantage of the fact that N-1 relations in the join are 
very small and can fit into main memory of each node. In this case, the small 
tables can be copied onto all the nodes and be joined with the data from the 
larger table. This saves the cost of sorting and partitioning the large table. 
+ This join type takes advantage of the fact that N-1 relations in the join are 
small. In this case, the small tables can be copied onto all the nodes and be 
joined with the data from the larger table. This saves the cost of sorting and 
partitioning the large table. The performance benefits can be even greater if 
small tables fit into main memory; otherwise both the small tables and the 
partition of the large need to be sorted which is still better than having to 
shuffle the large table.
+ 
+ If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.
+ 
  For Hadoop this means that the join can happen on the map side. 
  
- The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by GROUP or ORDER 
BY on the prefix of the join key. This should be taken into account when 
choosing join type.
+ The data coming out of the join is not guaranteed to be sorted on the join 
key which could cause problems for queries that follow join by `GROUP` or 
`ORDER BY` on the prefix of the join key. This should be taken into account 
when choosing join type.
- 
- If you have several larger tables in the join that can't fit into memory, it 
might be beneficial to split the join to fit FRJ pattern since it would 
significantly reduce the size of the data going into the next join and might 
even allow to use FRJ again.
  
  Note that CROSS can take advantage of this approach as well.
  
@@ -45, +46 @@

  
  == Metadata ==
  
- To choose best join algorithm, additional information about the data is 
required. This data can be stored with the data or in a separate repository in 
which case Pig can consume this data and make choices on user's behalf. 
However, part of Pig philosophy is to it anything which means in this case to 
operate correctly and as efficiently as possible in the absence of the 
metadata. Also, even if metadata is available user should be able to disable 
its use. 
+ To choose best join algorithm, additional information about the data is 
required. This data can be stored with the data or in a separate repository in 
which case Pig can consume this data and make choices on user's behalf. 
However, part of Pig philosophy is to eat anything which means in this case to 
operate correctly and as efficiently as possible in the absence of the 
metadata. Also, even if metadata is available user should be able to disable 
its use. 
- 
- Questions:
- 
-  1. Should user be able to provide a conflicting information? I would think 
not?
-  2. Should user only be able to disable all optimizations or class of 
optimizations or particular optimization? What does Oracle do?
  
  === Metadata Available ===
  
  If metadata is available pig will pull this metadata and use it as part of 
optimization. The details of how this would be done is beyond the scope of this 
document. The required data would need to be communicated as part of Pig 
requirements of the metadata repository whenever one is available.
  
+ A user can disable optimizations by using `set` command:
+ 
+ {{{
+ set optimizations.all 'off'
+ }}}
+ 
+ We might choose later to allow only particular types of optimizations to be 
disabled like:
+ 
+ {{{
+ set optimizations.join 'off'
+ set optimizations.reorder 'off'
+ }}}
+ 
+ 
  === No Metadata Available ===
  

Reply via email to