Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/JoinFramework ------------------------------------------------------------------------------ = Join Framework = == Objective == - + x This document provides a comprehensive view of performing joins in Pig. By `JOIN` here we mean traditional inner/outer `SQL` joins which in Pig are realized via `COGROUP` followed by flatten of the relations. Some of the approaches described in this document can also be applied to `CROSS` and `GROUP` as well. @@ -29, +29 @@ === Fragment Replicate Join (FRJ) === This join type takes advantage of the fact that N-1 relations in the join are small. In this case, the small tables can be copied onto all the nodes and be joined with the data from the larger table. This saves the cost of sorting and partitioning the large table. The performance benefits can be even greater if small tables fit into main memory; otherwise, both the small tables and the partition of the large need to be sorted which is still better than having to shuffle the large table. + + There are a couple of open questions here: + + 1. How do we know that the data is small enough to use this type of join. If the join comes right after the load, we know the input data sizes. However, if data undergone many transformations in between we would not be able to tell. Seems like at least initially, we should limit this to joins coming right after load or to FRJ explicitly requested by the user. + 2. How do we know if the data will fit into memory. One way to do it is to try to load it and if it fails to fall back to the other type. We need to investigate how feasible this approach is. If you have several larger tables in the join, it might be beneficial to split the join to fit FRJ pattern since it would significantly reduce the size of the data going into the next join and might even allow to use FRJ again. @@ -95, +100 @@ C = JOIN A by name, B by name USING 'ordered partitioned'; }}} - Question: how far do we want to take that? If we have one table that is partitioned and the other one that is both partitioned and ordered and the third one that is neither - do we want to come up with name or do we require metadata specification in this case? I think we should require metadata. + It is reasonable to allow the user to force the join type for simple cases when all tables fit the same profile like all partitioned and ordered. But if different tables have different profiles like one is both partitioned and ordered, another only partitioned and yet another neither, we would force users to provide per table metadata for pig to make right choices. For FRJ, we don't really need additional metadata for now. We can just work of input data sizes. Eventually, it would be nice to estimate the actual amount of data going into join as oppose to just input sizes but for that we would need data statistics. For this particular type, explicit user specification via join type might be useful.