+1. Makes sense to me
> -----Original Message----- > From: Chris Olston [mailto:[EMAIL PROTECTED] > Sent: Monday, June 16, 2008 10:29 AM > To: [email protected] > Subject: Re: Issues with group as an alias > > No. > > The standing proposal for Option III is: > > 1. If you are (CO)Grouping on a *single* field AND in the > case of co- group all field names are the same (e.g., cogroup > A by url, B by url), then give the group key that name (e.g., "url"). > 2. Else, do *not* automatically assign any name. The user can > refer to it as $0 and/or use "AS" to give it a name manually. > > (To be clear, even in case #1, the user has the option to > override the automatically-assigned name using "AS" if s/he chooses.) > > -Chris > > > On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote: > > > I completely agree. It does start getting confusing. > Especially if we > > try to deal with multi field keys. > > > > A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load > > 'somefile2' USING PigStorage() AS (A, C, Y) C = load > 'somefile3' USING > > PigStorage() AS (A, B) > > > > G1 = COGROUP A by (B,C), B by (A, C); > > G2 = COGROUP G1 by (B_C, A.Z), C by (A, B); > > > > What is the schema for G2? > > > > ben > > > > On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote: > >> So what is the conclusion here ? > >> > >> group key alias == the first variables group by field ? > >> > >> > >> What happens in a case like this then : > >> > >> -- > >> A = load 'somefile1' USING PigStorage() AS (B, C) B = load > >> 'somefile2' USING PigStorage() AS (A, C) C = load > 'somefile3' USING > >> PigStorage() AS (A, B) > >> > >> G1 = COGROUP A by B, B by A; > >> G2 = COGROUP A by C, C by A; > >> ... > >> -- > >> > >> A slightly contrived example for sure, but imo grammer has > to be as > >> clearly specified as possible. > >> > >> A reserved keyword as group alias implies we dont hit this problem > >> (group or groupkey or grpkey)... and also the fact that we are > >> backwardly compatible. > >> > >> [I never liked inferred schema prefix section in the schemas doc > >> (which is applied selectively) - makes it extremely tough > to generate > >> pig scripts] > >> > >> > >> Regards, > >> Mridul > >> > >> Alan Gates wrote: > >>> Currently in Pig Latin, anytime a (CO)GROUP statement is > used, the > >>> field (or set of fields) that are grouped on are given the alias > >>> 'group'. > >>> This has a couple of issues: > >>> > >>> 1) It's confusing. 'group' is now a keyword and an alias. > >>> 2) We don't currently allow 'group' as an alias in an AS. It is > >>> strange to have an alias that can only be assigned by the > language > >>> and never by the user. > >>> > >>> Possible solutions: > >>> > >>> I) Status quo. We could fix it so that group is allowed to be > >>> assigned as an alias in AS. > >>> > >>> Pros: Backward compatibility > >>> Cons: a) will make the parser more complicated > >>> b) see 1) above. > >>> > >>> > >>> II) Don't give an implicit alias to the group key(s). If > users want > >>> an alias, they can assign it using AS. > >>> > >>> Pros: Simplicity > >>> Cons: We do assign aliases to grouped bags. That is, if > we have C > >>> = GROUP B by $0 the resulting schema of C is (group, B). > So if we > >>> don't > >>> assign an alias to the group key, we now have a schema ($0, B). > >>> This > >>> seems strange. And worse yet, if users want to alias the group > >>> key(s), they'll be forced to alias all the grouped bags as well. > >>> > >>> III) Carry the alias (if any) that the field had before. > So if we > >>> had a script like: > >>> > >>> A = load 'myfile' as (x, y, z); > >>> B = group A by x; > >>> > >>> The the schema of B would be (x, A). This is quite natural for > >>> grouping of single columns. But it turns nasty when you group on > >>> multiple columns. Do we then append the names to > together? So if > >>> you have > >>> > >>> B = group A by x, y; > >>> > >>> is the resulting schema (x_y, A)? Ugh. > >>> > >>> In this case there is also the question of what to do in > the case of > >>> cogroups, where the key may be named differently in different > >>> relations. > >>> > >>> A = load 'myfile' as (x, y, z); > >>> B = load 'myotherfile' as (t, u, v); C = cogroup A by x, B by t; > >>> > >>> Is the resulting schema (x, A, B) or (t, A, B) or are > both valid? > >>> This could be resolved by either saying first one always wins, or > >>> allowing either. > >>> > >>> Pros: Very natural for the users, their fields maintain names > >>> through the query. > >>> Cons: Quickly gets burdensome in the case of multi-key groups. > >>> > >>> IV) Assign a non-keyword alias to the group key, like grp or > >>> groupkey or grpkey (or some other suitable choice). > >>> Pros: Least disruptive change. Users only have to go > through their > >>> scripts and find places where they use the group alias > and change it > >>> to grp (or whatever). > >>> Cons: Still leaves us with a situation where we are assigning a > >>> name to a field arbtrarily, leaving users confused as to > how their > >>> fields got named that. > >>> > >>> V) Remove GROUP as a keyword. It is just short for > COGROUP of one > >>> relation anyway. > >>> > >>> Pros: Smaller syntax in a language is always good. > >>> Cons: Will break a lot of scripts, and confuse a lot of > users who > >>> only think in terms of GROUP and JOIN and never use COGROUP > >>> explicitly. > >>> > >>> One could also conceive of combinations of these. For > example, we > >>> always assign a name like grpkey to the group key(s), and in the > >>> single key case we also carry forward the alias that the field > >>> already had, if any. > >>> > >>> Thoughts? Other possibilities? > >>> > >>> Alan. > > > > > > -- > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > > >
