pig-user  

Re: Outer joins

Iván de Prado
Wed, 11 Jun 2008 08:38:08 -0700

I understand your points. 

I was thinking in a posible solution that is not currently supported by
Pig:

D = FOREACH C GENERATE (ARITY(A) == 0 ? '','','' : flatten(A)), (ARITY(B) == 
0 ? '','','' : flatten(B));

That is not currently supported by Pig. Have the proposal sense? Would
it be difficult to add this possibility to the BindCond? 

Thanks, 
Iván de Prado
www.ivanprado.es 

El mié, 11-06-2008 a las 07:39 -0700, Alan Gates escribió:
> 3 foreachs and a union should get you what you want (I think that's what 
> you meant instead of filter).  It's a little nasty but I don't know of a 
> way around it.
> 
> As for why pig does this, in order to realize SQL outer join semantics, 
> it is necessary to know what ought to be there for the table that 
> doesn't have a match.  So in your example, pig would need to know that 
> both 1.txt and 2.txt have 3 fields.  But pig, by default, does not have 
> metadata.  So it does not know.
> 
> And it cannot figure it out by looking at existing rows.  This wouldn't 
> work in the case where one file was empty.  Also, unlike SQL, pig does 
> not place a constraint on your data that all tuples in a file be the 
> same.  So it would be legal for 1.txt to have a last line that was just 
> '1   2'.  That means pig can't guess what null columns it should append 
> in the outer join.
> 
> All that said, we are adding various ways for pig to interface with 
> metadata when it is available (though not require it).  Once we have 
> that I think we should support SQL outer join semantics when that is 
> what the user wants.
> 
> Alan.
> 
> Iván de Prado wrote:
> > Lets suppose 1.txt is:
> >
> > 1       1       1
> > 2       1       1
> > 3       1       1
> > 4       1       1
> > 5       1       1
> >
> > And 2.txt is:
> >
> > 4       2       2
> > 5       2       2
> > 6       2       2
> > 7       2       2
> >
> > The script: 
> >
> > A = LOAD 'ivan/1.txt' USING PigStorage();
> > B = LOAD 'ivan/2.txt' USING PigStorage();
> > C = COGROUP A by $0, B by $0;
> > D = FOREACH C GENERATE flatten(A), flatten(B);
> >
> > dump C:
> >
> > (1, {(1, 1, 1)}, {})
> > (2, {(2, 1, 1)}, {})
> > (3, {(3, 1, 1)}, {})
> > (4, {(4, 1, 1)}, {(4, 2, 2)})
> > (5, {(5, 1, 1)}, {(5, 2, 2)})
> > (6, {}, {(6, 2, 2)})
> > (7, {}, {(7, 2, 2)})
> > (8, {}, {(8, 2, 2)})
> >
> > dump D;
> >
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> >
> > But this is not the result that I expected. I would like to obtain this
> > result:
> >
> > (1, 1, 1, 1, '', '', '')
> > (2, 2, 1, 1, '', '', '')
> > (2, 3, 1, 1, '', '', '')
> > (4, 1, 1, 4, 2, 2)
> > (5, 1, 1, 5, 2, 2)
> > ('','','',6, 2, 2)
> > ('','','',7, 2, 2)
> > ('','','',8, 2, 2)
> >
> > This is the expected result if you do an outer join in SQL. How can I
> > modify the script to get this result?  (apart of doing 3 FILTERs over C
> > and then a UNION)
> >
> > Thanks and regards,
> > Iván de Prado
> > www.ivanprado.es
> >
> >