Re: Is Intermediate data written to disk?

bharath v Wed, 03 Feb 2010 20:15:15 -0800

Dimitry ,

Thanks for your reply . That was what I wanted .. I am new to pig , so i
couldn't express it . Got it .


Thanks



On Wed, Feb 3, 2010 at 8:23 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Hi Bharath,
> I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
> is not valid Pig syntax.
>
> Note that unlike an SQL join, a Pig join is based strictly on equality
> of one (possibly multi-valued) key.
>
> Meaning, where in sql you say:
>
> select * from a, b, c where a.id=b.id and a.id2 = c.id2
>
> (leaving the optimizer to figure out if it wants to do ((a join b)
> join c), or ((b cross c) join a), or ((a join c) join b), etc)
>
> In Pig you would explicitly state the order of joins:
>
> ab = JOIN a on id, b on id;
> abc = JOIN ab on a::id2, c on id2;
>
> If this is what you are talking about -- yes, ab will be materialized,
> as the second join requires a new Map-Reduce stage (there is a new key
> that the whole relation needs to be partitioned on).
>
> If, however, you mean simply joining multiple relations on the same
> key, as described earlier -- no, nothing is materialized unless you
> count the regular IO that needs to happen for a standard Map-Reduce
> join, and any possible spills to disk required when buffers run out of
> memory and such.
>
> Hope this helps.
>
> Dmitriy
>
> On Wed, Feb 3, 2010 at 4:17 AM, bharath v
> <[email protected]> wrote:
> > Dimitry,
> >
> > Suppose the command is like (A on a , B on b1 and B on b2 , C on c) ..
> Then
> > it requires storing the intermediate join of AB on to disk right?
> >
> > Thanks
> >
> > On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> >
> >> if you explicitly join 3 or more relations with a single command ("d =
> >> join a on id, b on id, c on id;"), a and b will be buffered for each
> >> key, while c, the rightmost relation, will be streamed.
> >>
> >> This is on a per-reducer basis. There is of course a whole lot of IO
> >> going on for getting from the Mappers to Reducers, but none of it is
> >> the intermediate result of joining A to B.
> >>
> >> -Dmitriy
> >>
> >> On Tue, Feb 2, 2010 at 10:52 PM, bharath v
> >> <[email protected]> wrote:
> >> > Hi ,
> >> >
> >> > I have a small doubt in how pig handles queries containing join of
> more
> >> than
> >> > 2 tables .
> >> >
> >> > Suppose we have 3 tables A,B,C .. and the plan is  "((AB)C)" ..
> >> > We can join A,B in a map reduce job and join the resultant table with
> >> "C". I
> >> > have a doubt whether the result of "AB" is stored to disk before
> joining
> >> > with C or is it streamed directly to join with C (I dont know how ,
> just
> >> a
> >> > guess) .
> >> >
> >> > Any help is appreciated ,
> >> >
> >> > Thanks
> >> >
> >>
> >
>

Re: Is Intermediate data written to disk?

Reply via email to