Hi Bharath, I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c) is not valid Pig syntax.
Note that unlike an SQL join, a Pig join is based strictly on equality of one (possibly multi-valued) key. Meaning, where in sql you say: select * from a, b, c where a.id=b.id and a.id2 = c.id2 (leaving the optimizer to figure out if it wants to do ((a join b) join c), or ((b cross c) join a), or ((a join c) join b), etc) In Pig you would explicitly state the order of joins: ab = JOIN a on id, b on id; abc = JOIN ab on a::id2, c on id2; If this is what you are talking about -- yes, ab will be materialized, as the second join requires a new Map-Reduce stage (there is a new key that the whole relation needs to be partitioned on). If, however, you mean simply joining multiple relations on the same key, as described earlier -- no, nothing is materialized unless you count the regular IO that needs to happen for a standard Map-Reduce join, and any possible spills to disk required when buffers run out of memory and such. Hope this helps. Dmitriy On Wed, Feb 3, 2010 at 4:17 AM, bharath v <[email protected]> wrote: > Dimitry, > > Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then > it requires storing the intermediate join of AB on to disk right? > > Thanks > > On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy <[email protected]> wrote: > >> if you explicitly join 3 or more relations with a single command ("d = >> join a on id, b on id, c on id;"), a and b will be buffered for each >> key, while c, the rightmost relation, will be streamed. >> >> This is on a per-reducer basis. There is of course a whole lot of IO >> going on for getting from the Mappers to Reducers, but none of it is >> the intermediate result of joining A to B. >> >> -Dmitriy >> >> On Tue, Feb 2, 2010 at 10:52 PM, bharath v >> <[email protected]> wrote: >> > Hi , >> > >> > I have a small doubt in how pig handles queries containing join of more >> than >> > 2 tables . >> > >> > Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" .. >> > We can join A,B in a map reduce job and join the resultant table with >> "C". I >> > have a doubt whether the result of "AB" is stored to disk before joining >> > with C or is it streamed directly to join with C (I dont know how , just >> a >> > guess) . >> > >> > Any help is appreciated , >> > >> > Thanks >> > >> >
