Iván de Prado
Wed, 11 Jun 2008 08:42:49 -0700
I think the reason is that the flatten is a cross product. So {} crossed
with something is nothing.
Iván de Prado
www.ivanprado.es
El mié, 11-06-2008 a las 08:25 -0700, Prashanth Pappu escribió:
> But why doesn't
>
> D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
> dump D;
>
> give
>
> (1, 1, 1 )
> (2, 1, 1 )
> (3, 1, 1 )
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> (6, 2, 2)
> (7, 2, 2)
>
> FLATTEN({(1,1,1)}), FLATTEN({}) = {()} ?
>
> Prashanth
>
>
>
>
> On Wed, Jun 11, 2008 at 7:39 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
>
> > 3 foreachs and a union should get you what you want (I think that's what
> > you meant instead of filter). It's a little nasty but I don't know of a way
> > around it.
> >
> > As for why pig does this, in order to realize SQL outer join semantics, it
> > is necessary to know what ought to be there for the table that doesn't have
> > a match. So in your example, pig would need to know that both 1.txt and
> > 2.txt have 3 fields. But pig, by default, does not have metadata. So it
> > does not know.
> >
> > And it cannot figure it out by looking at existing rows. This wouldn't
> > work in the case where one file was empty. Also, unlike SQL, pig does not
> > place a constraint on your data that all tuples in a file be the same. So
> > it would be legal for 1.txt to have a last line that was just '1 2'. That
> > means pig can't guess what null columns it should append in the outer join.
> >
> > All that said, we are adding various ways for pig to interface with
> > metadata when it is available (though not require it). Once we have that I
> > think we should support SQL outer join semantics when that is what the user
> > wants.
> >
> > Alan.
> >
> >
> > Iván de Prado wrote:
> >
> >> Lets suppose 1.txt is:
> >>
> >> 1 1 1
> >> 2 1 1
> >> 3 1 1
> >> 4 1 1
> >> 5 1 1
> >>
> >> And 2.txt is:
> >>
> >> 4 2 2
> >> 5 2 2
> >> 6 2 2
> >> 7 2 2
> >>
> >> The script:
> >> A = LOAD 'ivan/1.txt' USING PigStorage();
> >> B = LOAD 'ivan/2.txt' USING PigStorage();
> >> C = COGROUP A by $0, B by $0;
> >> D = FOREACH C GENERATE flatten(A), flatten(B);
> >>
> >> dump C:
> >>
> >> (1, {(1, 1, 1)}, {})
> >> (2, {(2, 1, 1)}, {})
> >> (3, {(3, 1, 1)}, {})
> >> (4, {(4, 1, 1)}, {(4, 2, 2)})
> >> (5, {(5, 1, 1)}, {(5, 2, 2)})
> >> (6, {}, {(6, 2, 2)})
> >> (7, {}, {(7, 2, 2)})
> >> (8, {}, {(8, 2, 2)})
> >>
> >> dump D;
> >>
> >> (4, 1, 1, 4, 2, 2)
> >> (5, 1, 1, 5, 2, 2)
> >>
> >> But this is not the result that I expected. I would like to obtain this
> >> result:
> >>
> >> (1, 1, 1, 1, '', '', '')
> >> (2, 2, 1, 1, '', '', '')
> >> (2, 3, 1, 1, '', '', '')
> >> (4, 1, 1, 4, 2, 2)
> >> (5, 1, 1, 5, 2, 2)
> >> ('','','',6, 2, 2)
> >> ('','','',7, 2, 2)
> >> ('','','',8, 2, 2)
> >>
> >> This is the expected result if you do an outer join in SQL. How can I
> >> modify the script to get this result? (apart of doing 3 FILTERs over C
> >> and then a UNION)
> >>
> >> Thanks and regards,
> >> Iván de Prado
> >> www.ivanprado.es
> >>
> >>
> >>
> >