Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 

The following page has been changed by OlgaN:

  = Join Framework =
  == Objective ==
+ x
  This document provides a comprehensive view of performing joins in Pig. By 
`JOIN` here we mean traditional inner/outer `SQL` joins which in Pig are 
realized via `COGROUP` followed by flatten of the relations.
  Some of the approaches described in this document can also be applied to 
`CROSS` and `GROUP` as well.
@@ -29, +29 @@

  === Fragment Replicate Join (FRJ) ===
  This join type takes advantage of the fact that N-1 relations in the join are 
small. In this case, the small tables can be copied onto all the nodes and be 
joined with the data from the larger table. This saves the cost of sorting and 
partitioning the large table. The performance benefits can be even greater if 
small tables fit into main memory; otherwise, both the small tables and the 
partition of the large need to be sorted which is still better than having to 
shuffle the large table.
+ There are a couple of open questions here:
+    1. How do we know that the data is small enough to use this type of join. 
If the join comes right after the load, we know the input data sizes. However, 
if data undergone many transformations in between we would not be able to tell. 
Seems like at least initially, we should limit this to joins coming right after 
load or to FRJ explicitly requested by the user.
+    2. How do we know if the data will fit into memory. One way to do it is to 
try to load it and if it fails to fall back to the other type. We need to 
investigate how feasible this approach is.
  If you have several larger tables in the join, it might be beneficial to 
split the join to fit FRJ pattern since it would significantly reduce the size 
of the data going into the next join and might even allow to use FRJ again.
@@ -95, +100 @@

  C = JOIN A by name, B by name USING 'ordered partitioned';
- Question: how far do we want to take that? If we have one table that is 
partitioned and the other one that is both partitioned and ordered and the 
third one that is neither - do we want to come up with name or do we require 
metadata specification in this case? I think we should require metadata.
+ It is reasonable to allow the user to force the join type for simple cases 
when all tables fit the same profile like all partitioned and ordered. But if 
different tables have different profiles like one is both partitioned and 
ordered, another only partitioned and yet another neither, we would force users 
to provide per table metadata for pig to make right choices.
  For FRJ, we don't really need additional metadata for now. We can just work 
of input data sizes. Eventually, it would be nice to estimate the actual amount 
of data going into join as oppose to just input sizes but for that we would 
need data statistics. For this particular type, explicit user specification via 
join type might be useful.

Reply via email to