What about naming the rest of the fields in the group? Do we want to continue naming them with the names of the corresponding tables? I think users find that confusing as well.
Olga > -----Original Message----- > From: Alan Gates [mailto:[EMAIL PROTECTED] > Sent: Monday, June 16, 2008 11:32 AM > To: [email protected] > Subject: Re: Issues with group as an alias > > I would like to propose a slight modification: > > I think that we should continue to support 'group' as the > alias name for some transition period (3 or maybe 6 months). > We can remove all references to group as an alias from the > documentation and print a warning when users use it. But I > don't think we should drop it immediately, as we'll break > many scripts. > > Other than that I'm fine with the proposal. > > Alan. > > Chris Olston wrote: > > No. > > > > The standing proposal for Option III is: > > > > 1. If you are (CO)Grouping on a *single* field AND in the case of > > co-group all field names are the same (e.g., cogroup A by url, B by > > url), then give the group key that name (e.g., "url"). > > 2. Else, do *not* automatically assign any name. The user > can refer to > > it as $0 and/or use "AS" to give it a name manually. > > > > (To be clear, even in case #1, the user has the option to > override the > > automatically-assigned name using "AS" if s/he chooses.) > > > > -Chris > > > > > > On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote: > > > >> I completely agree. It does start getting confusing. > Especially if we > >> try to deal with multi field keys. > >> > >> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load > >> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3' > >> USING PigStorage() AS (A, B) > >> > >> G1 = COGROUP A by (B,C), B by (A, C); > >> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B); > >> > >> What is the schema for G2? > >> > >> ben > >> > >> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote: > >>> So what is the conclusion here ? > >>> > >>> group key alias == the first variables group by field ? > >>> > >>> > >>> What happens in a case like this then : > >>> > >>> -- > >>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load > >>> 'somefile2' USING PigStorage() AS (A, C) C = load > 'somefile3' USING > >>> PigStorage() AS (A, B) > >>> > >>> G1 = COGROUP A by B, B by A; > >>> G2 = COGROUP A by C, C by A; > >>> ... > >>> -- > >>> > >>> A slightly contrived example for sure, but imo grammer > has to be as > >>> clearly specified as possible. > >>> > >>> A reserved keyword as group alias implies we dont hit > this problem > >>> (group or groupkey or grpkey)... and also the fact that we are > >>> backwardly compatible. > >>> > >>> [I never liked inferred schema prefix section in the schemas doc > >>> (which is applied selectively) - makes it extremely tough to > >>> generate pig scripts] > >>> > >>> > >>> Regards, > >>> Mridul > >>> > >>> Alan Gates wrote: > >>>> Currently in Pig Latin, anytime a (CO)GROUP statement is > used, the > >>>> field (or set of fields) that are grouped on are given the alias > >>>> 'group'. > >>>> This has a couple of issues: > >>>> > >>>> 1) It's confusing. 'group' is now a keyword and an alias. > >>>> 2) We don't currently allow 'group' as an alias in an > AS. It is > >>>> strange to have an alias that can only be assigned by > the language > >>>> and never by the user. > >>>> > >>>> Possible solutions: > >>>> > >>>> I) Status quo. We could fix it so that group is allowed to be > >>>> assigned as an alias in AS. > >>>> > >>>> Pros: Backward compatibility > >>>> Cons: a) will make the parser more complicated > >>>> b) see 1) above. > >>>> > >>>> > >>>> II) Don't give an implicit alias to the group key(s). If users > >>>> want an alias, they can assign it using AS. > >>>> > >>>> Pros: Simplicity > >>>> Cons: We do assign aliases to grouped bags. That is, > if we have C > >>>> = GROUP B by $0 the resulting schema of C is (group, B). > So if we > >>>> don't assign an alias to the group key, we now have a > schema ($0, > >>>> B). This seems strange. And worse yet, if users want > to alias the > >>>> group key(s), they'll be forced to alias all the grouped bags as > >>>> well. > >>>> > >>>> III) Carry the alias (if any) that the field had before. > So if we > >>>> had a script like: > >>>> > >>>> A = load 'myfile' as (x, y, z); > >>>> B = group A by x; > >>>> > >>>> The the schema of B would be (x, A). This is quite natural for > >>>> grouping of single columns. But it turns nasty when you > group on > >>>> multiple columns. Do we then append the names to > together? So if > >>>> you have > >>>> > >>>> B = group A by x, y; > >>>> > >>>> is the resulting schema (x_y, A)? Ugh. > >>>> > >>>> In this case there is also the question of what to do in > the case > >>>> of cogroups, where the key may be named differently in different > >>>> relations. > >>>> > >>>> A = load 'myfile' as (x, y, z); > >>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by x, B by t; > >>>> > >>>> Is the resulting schema (x, A, B) or (t, A, B) or are > both valid? > >>>> This > >>>> could be resolved by either saying first one always wins, or > >>>> allowing either. > >>>> > >>>> Pros: Very natural for the users, their fields maintain names > >>>> through the query. > >>>> Cons: Quickly gets burdensome in the case of multi-key groups. > >>>> > >>>> IV) Assign a non-keyword alias to the group key, like grp or > >>>> groupkey or grpkey (or some other suitable choice). > >>>> Pros: Least disruptive change. Users only have to go through > >>>> their scripts and find places where they use the group alias and > >>>> change it to grp (or whatever). > >>>> Cons: Still leaves us with a situation where we are assigning a > >>>> name to a field arbtrarily, leaving users confused as to > how their > >>>> fields got named that. > >>>> > >>>> V) Remove GROUP as a keyword. It is just short for > COGROUP of one > >>>> relation anyway. > >>>> > >>>> Pros: Smaller syntax in a language is always good. > >>>> Cons: Will break a lot of scripts, and confuse a lot of > users who > >>>> only think in terms of GROUP and JOIN and never use COGROUP > >>>> explicitly. > >>>> > >>>> One could also conceive of combinations of these. For > example, we > >>>> always assign a name like grpkey to the group key(s), and in the > >>>> single key case we also carry forward the alias that the field > >>>> already had, if any. > >>>> > >>>> Thoughts? Other possibilities? > >>>> > >>>> Alan. > >> > >> > > > > -- > > Christopher Olston, Ph.D. > > Sr. Research Scientist > > Yahoo! Research > > > > > > >
