Prashanth Pappu
Wed, 11 Jun 2008 08:25:47 -0700
But why doesn't
D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
dump D;
give
(1, 1, 1 )
(2, 1, 1 )
(3, 1, 1 )
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
(6, 2, 2)
(7, 2, 2)
FLATTEN({(1,1,1)}), FLATTEN({}) = {()} ?
Prashanth
On Wed, Jun 11, 2008 at 7:39 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> 3 foreachs and a union should get you what you want (I think that's what
> you meant instead of filter). It's a little nasty but I don't know of a way
> around it.
>
> As for why pig does this, in order to realize SQL outer join semantics, it
> is necessary to know what ought to be there for the table that doesn't have
> a match. So in your example, pig would need to know that both 1.txt and
> 2.txt have 3 fields. But pig, by default, does not have metadata. So it
> does not know.
>
> And it cannot figure it out by looking at existing rows. This wouldn't
> work in the case where one file was empty. Also, unlike SQL, pig does not
> place a constraint on your data that all tuples in a file be the same. So
> it would be legal for 1.txt to have a last line that was just '1 2'. That
> means pig can't guess what null columns it should append in the outer join.
>
> All that said, we are adding various ways for pig to interface with
> metadata when it is available (though not require it). Once we have that I
> think we should support SQL outer join semantics when that is what the user
> wants.
>
> Alan.
>
>
> Iván de Prado wrote:
>
>> Lets suppose 1.txt is:
>>
>> 1 1 1
>> 2 1 1
>> 3 1 1
>> 4 1 1
>> 5 1 1
>>
>> And 2.txt is:
>>
>> 4 2 2
>> 5 2 2
>> 6 2 2
>> 7 2 2
>>
>> The script:
>> A = LOAD 'ivan/1.txt' USING PigStorage();
>> B = LOAD 'ivan/2.txt' USING PigStorage();
>> C = COGROUP A by $0, B by $0;
>> D = FOREACH C GENERATE flatten(A), flatten(B);
>>
>> dump C:
>>
>> (1, {(1, 1, 1)}, {})
>> (2, {(2, 1, 1)}, {})
>> (3, {(3, 1, 1)}, {})
>> (4, {(4, 1, 1)}, {(4, 2, 2)})
>> (5, {(5, 1, 1)}, {(5, 2, 2)})
>> (6, {}, {(6, 2, 2)})
>> (7, {}, {(7, 2, 2)})
>> (8, {}, {(8, 2, 2)})
>>
>> dump D;
>>
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>>
>> But this is not the result that I expected. I would like to obtain this
>> result:
>>
>> (1, 1, 1, 1, '', '', '')
>> (2, 2, 1, 1, '', '', '')
>> (2, 3, 1, 1, '', '', '')
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>> ('','','',6, 2, 2)
>> ('','','',7, 2, 2)
>> ('','','',8, 2, 2)
>>
>> This is the expected result if you do an outer join in SQL. How can I
>> modify the script to get this result? (apart of doing 3 FILTERs over C
>> and then a UNION)
>>
>> Thanks and regards,
>> Iván de Prado
>> www.ivanprado.es
>>
>>
>>
>