Dimitry , Thanks for your reply . That was what I wanted .. I am new to pig , so i couldn't express it . Got it .
Thanks On Wed, Feb 3, 2010 at 8:23 PM, Dmitriy Ryaboy <[email protected]> wrote: > Hi Bharath, > I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c) > is not valid Pig syntax. > > Note that unlike an SQL join, a Pig join is based strictly on equality > of one (possibly multi-valued) key. > > Meaning, where in sql you say: > > select * from a, b, c where a.id=b.id and a.id2 = c.id2 > > (leaving the optimizer to figure out if it wants to do ((a join b) > join c), or ((b cross c) join a), or ((a join c) join b), etc) > > In Pig you would explicitly state the order of joins: > > ab = JOIN a on id, b on id; > abc = JOIN ab on a::id2, c on id2; > > If this is what you are talking about -- yes, ab will be materialized, > as the second join requires a new Map-Reduce stage (there is a new key > that the whole relation needs to be partitioned on). > > If, however, you mean simply joining multiple relations on the same > key, as described earlier -- no, nothing is materialized unless you > count the regular IO that needs to happen for a standard Map-Reduce > join, and any possible spills to disk required when buffers run out of > memory and such. > > Hope this helps. > > Dmitriy > > On Wed, Feb 3, 2010 at 4:17 AM, bharath v > <[email protected]> wrote: > > Dimitry, > > > > Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. > Then > > it requires storing the intermediate join of AB on to disk right? > > > > Thanks > > > > On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > > >> if you explicitly join 3 or more relations with a single command ("d = > >> join a on id, b on id, c on id;"), a and b will be buffered for each > >> key, while c, the rightmost relation, will be streamed. > >> > >> This is on a per-reducer basis. There is of course a whole lot of IO > >> going on for getting from the Mappers to Reducers, but none of it is > >> the intermediate result of joining A to B. > >> > >> -Dmitriy > >> > >> On Tue, Feb 2, 2010 at 10:52 PM, bharath v > >> <[email protected]> wrote: > >> > Hi , > >> > > >> > I have a small doubt in how pig handles queries containing join of > more > >> than > >> > 2 tables . > >> > > >> > Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" .. > >> > We can join A,B in a map reduce job and join the resultant table with > >> "C". I > >> > have a doubt whether the result of "AB" is stored to disk before > joining > >> > with C or is it streamed directly to join with C (I dont know how , > just > >> a > >> > guess) . > >> > > >> > Any help is appreciated , > >> > > >> > Thanks > >> > > >> > > >
