[Santhosh] Yes, a visitor is probably a cleaner way to do the translation.
The question: is the inability to return multiple projections from one production a limit of how the parser is implemented or the tool, javacc, used for the parser? [Santhosh] It's the design/implementation and not the tool. Compared to 1.x, the types branch does not have the equivalent of a StarSpec. As a result, we do not distinguish between project(0) and project( * ) during parse time. Thanks, Santhosh -----Original Message----- From: Alan Gates [mailto:[EMAIL PROTECTED] Sent: Friday, October 03, 2008 10:48 AM To: [email protected] Subject: Re: Semantics of generate * A thought and a question. The thought: rather than doing each individual operator do the translation, could a visitor be written that would walk the tree right after parsing and break project( * ) into project(1), project(2)... ? This visitor could be one of the validators (like the type checker). This way all of the logic for this restitching is in one place. The question: is the inability to return multiple projections from one production a limit of how the parser is implemented or the tool, javacc, used for the parser? Alan. Santhosh Srinivasan wrote: > In the current implementation of generate * in the front end, a single > projection operator with the star attribute set to true is created. > During the schema computation, instead of generating the schema of the > projection input, a tuple that contains the schema of the projection > input is created. This results in double wrapping. An example will > illustrate the problem. > > grunt> a = load 'one' using PigStorage(' ') as (field1, field2, field3); > grunt> b = load 'two' as (field4, field5, field6); > grunt> c = cogroup a by $0, b by $0; > grunt> d = foreach c generate *; > grunt> describe d; > > d: {c: (group: bytearray,a: {field1: bytearray,field2: bytearray,field3: > bytearray},b: {field4: bytearray,field5: bytearray,field6: bytearray})} > > In the above example, the schema for operator d should have been > identical to that of operator c. Instead, the schema of operator c is > wrapped in a tuple and embedded within the schema of d. As a result, we > have a couple of issues: > > 1. It is not intuitive to users that the schema of c and d are not > identical. They should be identical. > > grunt> e = foreach d generate group; > > 2008-10-02 16:06:11,335 [main] ERROR > org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Invalid > alias: group in {c: (group: bytearray,a: {field1: bytearray,field2: > bytearray,field3: bytearray},b: {field4: bytearray,field5: > bytearray,field6: bytearray})} > > 2. As a workaround, we could flatten the contents of d and then access > the contents of c. > > grunt> e = foreach d generate flatten($0); > grunt> e = foreach d generate flatten($0); > grunt> describe e; > > e: {c::group: bytearray,c::a: {field1: bytearray,field2: > bytearray,field3: bytearray},c::b: {field4: bytearray,field5: > bytearray,field6: bytearray}} > > However, we will not be able to compute the lineage of the fields of > relation, as demonstrated by the following example: > > grunt> f = foreach e generate flatten(a), flatten(b); > grunt> g = foreach f generate field1 + 1; > grunt> describe g; > > 2008-10-02 16:26:20,655 [main] WARN org.apache.pig.PigServer - > bytearray is implicitly casted to integer under LOAdd Operator > 2008-10-02 16:26:20,655 [main] ERROR org.apache.pig.PigServer - Problem > resolving LOForEach schema Cannot resolve load function to use for > casting from bytearray to integer. Found more than one load function to > use: [org.apache.pig.builtin.PigStorage, > org.apache.pig.builtin.BinStorage] > > This problem is contained in the frontend alone. In the backend, the > double wrapping issue is resolved with the bug PIG-359. In order to > resolve this issue in the frontend, the project( * ) operator has to be > translated into project(0), project(1), ..., project(n - 2), project(n - > 1); where n is the number of columns in the relation. > > The translation of project( * ) into the multiple project operators > cannot be performed in the parser without major modifications. Each > relational operator that has an inner plan, can perform this > translation. In the current design, LOForEach, LOCogroup, LOSplitOutput > LOSort and LOFilter have inner plans. > > There are corner cases that need to be handled during the translation. > If the schema of the project's input is not defined then the schema of > the relation or the column in the relation that contains the projection > could become undefined. > > a = laod 'one'; > b = load 'two'; > c = foreach a generate *, $0, $1; -- schema of c is undefined > d = cogroup a by *, by by ($0, $1); -- schema of column named group in > cogroup is undefined; also arity checking cannot be enforced > > Thoughts? > > Thanks, > Santhosh >
