Re: Is Intermediate data written to disk?

Dmitriy Ryaboy Wed, 03 Feb 2010 06:53:37 -0800

Hi Bharath,
I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
is not valid Pig syntax.

Note that unlike an SQL join, a Pig join is based strictly on equality
of one (possibly multi-valued) key.

Meaning, where in sql you say:

select * from a, b, c where a.id=b.id and a.id2 = c.id2

(leaving the optimizer to figure out if it wants to do ((a join b)
join c), or ((b cross c) join a), or ((a join c) join b), etc)

In Pig you would explicitly state the order of joins:

ab = JOIN a on id, b on id;
abc = JOIN ab on a::id2, c on id2;

If this is what you are talking about -- yes, ab will be materialized,
as the second join requires a new Map-Reduce stage (there is a new key
that the whole relation needs to be partitioned on).

If, however, you mean simply joining multiple relations on the same
key, as described earlier -- no, nothing is materialized unless you
count the regular IO that needs to happen for a standard Map-Reduce
join, and any possible spills to disk required when buffers run out of
memory and such.

Hope this helps.

Dmitriy

On Wed, Feb 3, 2010 at 4:17 AM, bharath v
<[email protected]> wrote:
> Dimitry,
>
> Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
> it requires storing the intermediate join of AB on to disk right?
>
> Thanks
>
> On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy <[email protected]> wrote:
>
>> if you explicitly join 3 or more relations with a single command ("d =
>> join a on id, b on id, c on id;"), a and b will be buffered for each
>> key, while c, the rightmost relation, will be streamed.
>>
>> This is on a per-reducer basis. There is of course a whole lot of IO
>> going on for getting from the Mappers to Reducers, but none of it is
>> the intermediate result of joining A to B.
>>
>> -Dmitriy
>>
>> On Tue, Feb 2, 2010 at 10:52 PM, bharath v
>> <[email protected]> wrote:
>> > Hi ,
>> >
>> > I have a small doubt in how pig handles queries containing join of more
>> than
>> > 2 tables .
>> >
>> > Suppose we have 3 tables A,B,C .. and the plan is  "((AB)C)" ..
>> > We can join A,B in a map reduce job and join the resultant table with
>> "C". I
>> > have a doubt whether the result of "AB" is stored to disk before joining
>> > with C or is it streamed directly to join with C (I dont know how , just
>> a
>> > guess) .
>> >
>> > Any help is appreciated ,
>> >
>> > Thanks
>> >
>>
>

Re: Is Intermediate data written to disk?

Reply via email to