pig-user  

Re: Outer joins

Alan Gates
Wed, 11 Jun 2008 07:42:34 -0700

3 foreachs and a union should get you what you want (I think that's what you meant instead of filter). It's a little nasty but I don't know of a way around it.

As for why pig does this, in order to realize SQL outer join semantics, it is necessary to know what ought to be there for the table that doesn't have a match. So in your example, pig would need to know that both 1.txt and 2.txt have 3 fields. But pig, by default, does not have metadata. So it does not know.

And it cannot figure it out by looking at existing rows. This wouldn't work in the case where one file was empty. Also, unlike SQL, pig does not place a constraint on your data that all tuples in a file be the same. So it would be legal for 1.txt to have a last line that was just '1 2'. That means pig can't guess what null columns it should append in the outer join.

All that said, we are adding various ways for pig to interface with metadata when it is available (though not require it). Once we have that I think we should support SQL outer join semantics when that is what the user wants.

Alan.

Iván de Prado wrote:
Lets suppose 1.txt is:

1       1       1
2       1       1
3       1       1
4       1       1
5       1       1

And 2.txt is:

4       2       2
5       2       2
6       2       2
7       2       2

The script:
A = LOAD 'ivan/1.txt' USING PigStorage();
B = LOAD 'ivan/2.txt' USING PigStorage();
C = COGROUP A by $0, B by $0;
D = FOREACH C GENERATE flatten(A), flatten(B);

dump C:

(1, {(1, 1, 1)}, {})
(2, {(2, 1, 1)}, {})
(3, {(3, 1, 1)}, {})
(4, {(4, 1, 1)}, {(4, 2, 2)})
(5, {(5, 1, 1)}, {(5, 2, 2)})
(6, {}, {(6, 2, 2)})
(7, {}, {(7, 2, 2)})
(8, {}, {(8, 2, 2)})

dump D;

(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)

But this is not the result that I expected. I would like to obtain this
result:

(1, 1, 1, 1, '', '', '')
(2, 2, 1, 1, '', '', '')
(2, 3, 1, 1, '', '', '')
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
('','','',6, 2, 2)
('','','',7, 2, 2)
('','','',8, 2, 2)

This is the expected result if you do an outer join in SQL. How can I
modify the script to get this result?  (apart of doing 3 FILTERs over C
and then a UNION)

Thanks and regards,
Iván de Prado
www.ivanprado.es