I completely agree. It does start getting confusing. Especially if we try to deal with multi field keys.
A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3' USING PigStorage() AS (A, B) G1 = COGROUP A by (B,C), B by (A, C); G2 = COGROUP G1 by (B_C, A.Z), C by (A, B); What is the schema for G2? ben On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote: > So what is the conclusion here ? > > group key alias == the first variables group by field ? > > > What happens in a case like this then : > > -- > A = load 'somefile1' USING PigStorage() AS (B, C) > B = load 'somefile2' USING PigStorage() AS (A, C) > C = load 'somefile3' USING PigStorage() AS (A, B) > > G1 = COGROUP A by B, B by A; > G2 = COGROUP A by C, C by A; > ... > -- > > A slightly contrived example for sure, but imo grammer has to be as > clearly specified as possible. > > A reserved keyword as group alias implies we dont hit this problem > (group or groupkey or grpkey)... and also the fact that we are > backwardly compatible. > > [I never liked inferred schema prefix section in the schemas doc (which > is applied selectively) - makes it extremely tough to generate pig scripts] > > > Regards, > Mridul > > Alan Gates wrote: > > Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field > > (or set of fields) that are grouped on are given the alias 'group'. > > This has a couple of issues: > > > > 1) It's confusing. 'group' is now a keyword and an alias. > > 2) We don't currently allow 'group' as an alias in an AS. It is > > strange to have an alias that can only be assigned by the language and > > never by the user. > > > > Possible solutions: > > > > I) Status quo. We could fix it so that group is allowed to be assigned > > as an alias in AS. > > > > Pros: Backward compatibility > > Cons: a) will make the parser more complicated > > b) see 1) above. > > > > > > II) Don't give an implicit alias to the group key(s). If users want an > > alias, they can assign it using AS. > > > > Pros: Simplicity > > Cons: We do assign aliases to grouped bags. That is, if we have C = > > GROUP B by $0 the resulting schema of C is (group, B). So if we don't > > assign an alias to the group key, we now have a schema ($0, B). This > > seems strange. And worse yet, if users want to alias the group key(s), > > they'll be forced to alias all the grouped bags as well. > > > > III) Carry the alias (if any) that the field had before. So if we had a > > script like: > > > > A = load 'myfile' as (x, y, z); > > B = group A by x; > > > > The the schema of B would be (x, A). This is quite natural for grouping > > of single columns. But it turns nasty when you group on multiple > > columns. Do we then append the names to together? So if you have > > > > B = group A by x, y; > > > > is the resulting schema (x_y, A)? Ugh. > > > > In this case there is also the question of what to do in the case of > > cogroups, where the key may be named differently in different relations. > > > > A = load 'myfile' as (x, y, z); > > B = load 'myotherfile' as (t, u, v); > > C = cogroup A by x, B by t; > > > > Is the resulting schema (x, A, B) or (t, A, B) or are both valid? This > > could be resolved by either saying first one always wins, or allowing > > either. > > > > Pros: Very natural for the users, their fields maintain names through > > the query. > > Cons: Quickly gets burdensome in the case of multi-key groups. > > > > IV) Assign a non-keyword alias to the group key, like grp or groupkey or > > grpkey (or some other suitable choice). > > Pros: Least disruptive change. Users only have to go through their > > scripts and find places where they use the group alias and change it to > > grp (or whatever). > > Cons: Still leaves us with a situation where we are assigning a name to > > a field arbtrarily, leaving users confused as to how their fields got > > named that. > > > > V) Remove GROUP as a keyword. It is just short for COGROUP of one > > relation anyway. > > > > Pros: Smaller syntax in a language is always good. > > Cons: Will break a lot of scripts, and confuse a lot of users who only > > think in terms of GROUP and JOIN and never use COGROUP explicitly. > > > > One could also conceive of combinations of these. For example, we > > always assign a name like grpkey to the group key(s), and in the single > > key case we also carry forward the alias that the field already had, if > > any. > > > > Thoughts? Other possibilities? > > > > Alan.
