This is not exactly what my data is but it is a small example I saw in the reference manual 2. This may help describe what I am trying to do.
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#deref The LOAD statement in the example below is the same as what I use for my data. I only modified the data file to demonstrate multiple tuples in a bag. grunt> cat data {(1,1,1)} {(2,2,2)(3,3,3)} {(4,4,4)(5,5,5)(6,6,6)} grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)}); grunt> DESCRIBE A; A: {B: {T: (t1: int,t2: int,t3: int)}} grunt> X = FOREACH A GENERATE B.T.t1; 2010-07-31 16:09:46,659 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1028: Access to the tuple (T) of the bag is disallowed. Only access to the elements of the tuple in the bag is allowed. So I cannot dereference as, "B.T.t1" ? Maybe dereference operators do not work unless it is 2 levels, e.g. tuple.field ? However, if the reference manual gives this as an example, then how are the fields referenced for access in other relations? This is perhaps a separate question but here is what I have done. Maybe there is a simpler way to represent this in Pig. 1) Read SequenceFiles with a loadfunc 2) The SequenceFiles have data in the value that is an "array of fields" in Java 3) I thought that a Pig "bag of tuples" would be equivalent to a Java "array of fields" The loadfunc "getNext" only allows returning a tuple (not a bag). So what I do in "getNext" is: a) For each element of the Java array of fields, build a tuple that has those fields from Java b) Add the tuples to a bag c) Add the bag to a tuple and return that tuple from getNext Thanks, John Rodriguez Can you given an example of your data, and what output you want from the pig query ? That will help me understand what you want the query to do . From the schema and query, that is not very clear to me. -Thejas On 7/30/10 3:10 PM, "Rodriguez, John" <[email protected]> wrote: I have built a bag tuples where the tuples contain fields. I am reading SequenceFiles and have reading MyLoader to do this. I created a subset of all the fields, "isValid" to make the example simpler. I am not sure how to apply a dereference operator to this? A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using MyLoader() AS (data: bag{t: tuple(isValid:int)}); DESCRIBE A; A: {data: {t: (isValid: int)}} So all the ways that I have tried to dereference have syntax errors. B = GROUP A BY (data.t); 2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only access to the elements of the tuple in the bag is allowed. B = GROUP A BY (data.t.isValid); 2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only access to the elements of the tuple in the bag is allowed. B = GROUP A BY (t.isValid); 2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: t in {data: {t: (isValid: int)}} What is the proper way to do this? John Rodriguez
