On Jun 1, 2010, at 11:31 AM, Dave Viner wrote: > I am having some trouble getting cogroup and flattening to work as I'd like. > The cogroup statement looks like: > > cg = COGROUP A BY aid INNER, B BY bid; > > The cg group has rows in which the information in B may be empty (as > expected). I'd like to output a series of rows each of which has the same > number of columns. If the cg group has empty information for B, then it > should output either NULL or an empty string. But, I can't seem to make it > work. > > > for_output = FOREACH cg > GENERATE FLATTEN(A.aid) AS aid, > FLATTEN(B.optional_b_col); > > If the cogroup cg has empty values in the B bag, then there is no > corresponding row in for_output. > > How do I get the row to be added to for_output with an empty value for > "optional_b_col"? > > I also tried something like: > > for_output = FOREACH cg > GENERATE FLATTEN(A.aid) AS aid, > (B.optional_b_col IS NOT NULL ? B.optional_b_col : ''); >
Given that something similar is in the documentation, you would expect it to work. But it doesn't. See http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#Nulls the second example after "Nulls and Constants" says you can do : ------ "In this example of an outer join, if the join key is missing from a table it is replaced by null." A = LOAD 'student' AS (name: chararray, age: int, gpa: float); B = LOAD 'votertab10k' AS (name: chararray, age: int, registration: chararray, donation: float); C = COGROUP A BY name, B BY name; D = FOREACH C GENERATE FLATTEN((IsEmpty(A) ? null : A)), FLATTEN((IsEmpty(B) ? null : B)); ------- But I have had trouble with this working as described as an 'outer join' using cogroup. The techniques that hc.busy mentions work -- but are clunky and there aren't good alternatives that I know of at the moment. I'd love to hear what the "official" way to do an outer join using COGROUP is. FLATTEN hates being one of the sides of a conditional, so you can't do the intuitive: (isEmpty(B.optional_b_col) ? null : FLATTEN(B.optional_b_col) Instead you have to put a conditional inside FLATTEN, and replace the null with a 'bag of one tuple with one null field' so that it doesn't collapse the row and returns a null instead. With no built in ways to produce the 'bag of one tuple with one null field' (as of 0.7) this means writing your own UDF. OUTER JOIN is sometimes an option, but it isn't always an option, especially if you don't want to produce the cross product of the bags or need to do a more custom join. > But, this gives an error when trying to dump the results: > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1050: Unsupported input type > for BinCond: left hand side: bag; right hand side: chararray > > > I imagine there must be some way to output empty strings, I just can't seem > to figure it out. > > Thanks > Dave Viner
