Iván de Prado
Wed, 11 Jun 2008 08:38:08 -0700
I understand your points.
I was thinking in a posible solution that is not currently supported by
Pig:
D = FOREACH C GENERATE (ARITY(A) == 0 ? '','','' : flatten(A)), (ARITY(B) ==
0 ? '','','' : flatten(B));
That is not currently supported by Pig. Have the proposal sense? Would
it be difficult to add this possibility to the BindCond?
Thanks,
Iván de Prado
www.ivanprado.es
El mié, 11-06-2008 a las 07:39 -0700, Alan Gates escribió:
> 3 foreachs and a union should get you what you want (I think that's what
> you meant instead of filter). It's a little nasty but I don't know of a
> way around it.
>
> As for why pig does this, in order to realize SQL outer join semantics,
> it is necessary to know what ought to be there for the table that
> doesn't have a match. So in your example, pig would need to know that
> both 1.txt and 2.txt have 3 fields. But pig, by default, does not have
> metadata. So it does not know.
>
> And it cannot figure it out by looking at existing rows. This wouldn't
> work in the case where one file was empty. Also, unlike SQL, pig does
> not place a constraint on your data that all tuples in a file be the
> same. So it would be legal for 1.txt to have a last line that was just
> '1 2'. That means pig can't guess what null columns it should append
> in the outer join.
>
> All that said, we are adding various ways for pig to interface with
> metadata when it is available (though not require it). Once we have
> that I think we should support SQL outer join semantics when that is
> what the user wants.
>
> Alan.
>
> Iván de Prado wrote:
> > Lets suppose 1.txt is:
> >
> > 1 1 1
> > 2 1 1
> > 3 1 1
> > 4 1 1
> > 5 1 1
> >
> > And 2.txt is:
> >
> > 4 2 2
> > 5 2 2
> > 6 2 2
> > 7 2 2
> >
> > The script:
> >
> > A = LOAD 'ivan/1.txt' USING PigStorage();
> > B = LOAD 'ivan/2.txt' USING PigStorage();
> > C = COGROUP A by $0, B by $0;
> > D = FOREACH C GENERATE flatten(A), flatten(B);
> >
> > dump C:
> >
> > (1, {(1, 1, 1)}, {})
> > (2, {(2, 1, 1)}, {})
> > (3, {(3, 1, 1)}, {})
> > (4, {(4, 1, 1)}, {(4, 2, 2)})
> > (5, {(5, 1, 1)}, {(5, 2, 2)})
> > (6, {}, {(6, 2, 2)})
> > (7, {}, {(7, 2, 2)})
> > (8, {}, {(8, 2, 2)})
> >
> > dump D;
> >
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> >
> > But this is not the result that I expected. I would like to obtain this
> > result:
> >
> > (1, 1, 1, 1, '', '', '')
> > (2, 2, 1, 1, '', '', '')
> > (2, 3, 1, 1, '', '', '')
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> > ('','','',6, 2, 2)
> > ('','','',7, 2, 2)
> > ('','','',8, 2, 2)
> >
> > This is the expected result if you do an outer join in SQL. How can I
> > modify the script to get this result? (apart of doing 3 FILTERs over C
> > and then a UNION)
> >
> > Thanks and regards,
> > Iván de Prado
> > www.ivanprado.es
> >
> >