There is a .. 'feature' (or bug, but no resolution) with FLATTEN that I have 
ran into and others have on this list as well.

Pig will usually flatten nested elements multiple times rather than just one 
element.  Your problem is the bag with a tuple in it, its probably doing extra 
layers of unpacking at execution time -- one FLATTEN takes the bag of a tuple 
and converts it directly into the two inner fields. 

The planner and the execution engine have different ideas of what FLATTEN does 
when you nest things like that.  For example, a Bag with a Tuple and a Bag in 
it, when flattened, tends to unpack the inner tuple too.

I saw most of that behavior on 0.5 and 0.6.  I haven't tried again on 0.7.    
But I think it would be wise for Pig to consider making an operator that 
unpacks tuples and does NOT touch bags, and one that is vice-versa.


On Jul 8, 2010, at 2:56 AM, Sparsh Gupta wrote:

> Hello
> 
> I am working on a dataset which has relations of the type:
> 
> data: {a: (a1: chararray,a2_bag: {a2_tuple: (a21: chararray,a22:
> chararray)}, a3_bag: {a3_tuple: (a3: long)})}
> 
> What this means that, the each data row will have one 'a1' field, an
> 'a2_bag' bag which can have n number of 'a2_tuple' tuples each having 'a21'
> and 'a22'  fields. It also has another bag 'a3_bag' with m number of
> 'a3_tuple' tuples having 'a3' field each.
> 
> I want to get rid of all the bags and want all data flattened into the
> format (ofcourse creating multiple rows of dataNew from each row of data):
> 
> dataNew: {a21:chararray , a22: chararray, a3:long}
> 
> I tried using FLATTEN on 'a2_bag' and 'a3_bag'  to get
> temp: {a2_bag::a2_tuple(a21:chararray, a22:chararray) ,
> a3_bag::a3_tuple(a3:long)}
> 
> then I FLATTEN it again as
> temp1 = FOREACH temp GENERATE FLATTEN(a2_bag::a2_tuple) AS (a21:chararray,
> a22:chararray), FLATTEN(a3_bag::a3_tuple) AS (a3:long);
> 
> when I describe temp1, I get the desired structure but when I try to execute
> it (dump it say), I get an error as cannot convert String to Tuple.
> 
> Please let me know if I am wrong somewhere (well I am) and whats the best
> way to solve this problem
> 
> P.S. I am using Pig 0.6 and use elephant-bird to get data out from HBase and
> use twitter's code to get protocol buffered data into pig readable format.
> 
> Thanks
> Sparsh Gupta

Reply via email to