This is not exactly what my data is but it is a small example I saw in
the reference manual 2. This may help describe what I am trying to do.

  http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#deref

 

The LOAD statement in the example below is the same as what I use for my
data. I only modified the data file to demonstrate multiple tuples in a
bag.

 

grunt> cat data

{(1,1,1)}

{(2,2,2)(3,3,3)}

{(4,4,4)(5,5,5)(6,6,6)}

grunt> A = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)});

grunt> DESCRIBE A;

A: {B: {T: (t1: int,t2: int,t3: int)}}

grunt> X = FOREACH A GENERATE B.T.t1;

2010-07-31 16:09:46,659 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1028: Access to the tuple (T) of the bag is disallowed. Only
access to the elements of the tuple in the bag is allowed.

 

So I cannot dereference as, "B.T.t1" ?

 

Maybe dereference operators do not work unless it is 2 levels, e.g.
tuple.field ?

 

However, if the reference manual gives this as an example, then how are
the fields referenced for access in other relations?

 

This is perhaps a separate question but here is what I have done. Maybe
there is a simpler way to represent this in Pig.

1) Read SequenceFiles with a loadfunc

2) The SequenceFiles have data in the value that is an "array of fields"
in Java

3) I thought that a Pig "bag of tuples" would be equivalent to a Java
"array of fields"

 

The loadfunc "getNext" only allows returning a tuple (not a bag). So
what I do in "getNext" is:

a) For each element of the Java array of fields, build a tuple that has
those fields from Java

b) Add the tuples to a bag

c) Add the bag to a tuple and return that tuple from getNext

 

Thanks,

John Rodriguez

 

 

 

Can you given an example of your data, and what output you want from the
pig query ?

That will help me understand what you want the query to do . From the
schema and query, that is not very clear to me.

-Thejas



On 7/30/10 3:10 PM, "Rodriguez, John" <[email protected]> wrote:

I have built a bag tuples where the tuples contain fields.



I am reading SequenceFiles and have reading MyLoader to do this. I
created a subset of all the fields, "isValid" to make the example
simpler.



I am not sure how to apply a dereference operator to this?



A = LOAD '/data/NetFlowDigests/rk/DigestMessage/part-r-00000' using
MyLoader() AS (data: bag{t: tuple(isValid:int)});

DESCRIBE A;

A: {data: {t: (isValid: int)}}



So all the ways that I have tried to dereference have syntax errors.



B = GROUP A BY (data.t);

2010-07-30 21:51:29,881 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
access to the elements of the tuple in the bag is allowed.



B = GROUP A BY (data.t.isValid);

2010-07-30 21:54:11,157 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1028: Access to the tuple (t) of the bag is disallowed. Only
access to the elements of the tuple in the bag is allowed.



B = GROUP A BY (t.isValid);

2010-07-30 21:55:31,475 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Invalid alias: t in {data: {t:
(isValid: int)}}



What is the proper way to do this?



John Rodriguez





 

Reply via email to