Chris, What I meant to ask was what do we do with the rest of the fields in the group tuples. Currently, we name those fields with the names of the correspondent tables. I was asking if we want to continue that. I know that people find it confusing to see fields named after relations.
Olga > -----Original Message----- > From: Chris Olston [mailto:[EMAIL PROTECTED] > Sent: Monday, June 16, 2008 12:54 PM > To: [email protected] > Subject: Re: Issues with group as an alias > > Olga, > > The idea is that when there is just one field with one name, > we use that name for the group key. In all other cases we do > *not* supply an automatic name (the user can assign their own > name using "as"). > > I believe this solution: (1) is very simple and unambiguous, > and (2) makes common cases very natural (e.g, BAR = group FOO > by URL; foreach BAR generate URL, ...). > > -Chris > > On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote: > > > What about naming the rest of the fields in the group? Do > we want to > > continue naming them with the names of the corresponding tables? I > > think users find that confusing as well. > > > > Olga > > > >> -----Original Message----- > >> From: Alan Gates [mailto:[EMAIL PROTECTED] > >> Sent: Monday, June 16, 2008 11:32 AM > >> To: [email protected] > >> Subject: Re: Issues with group as an alias > >> > >> I would like to propose a slight modification: > >> > >> I think that we should continue to support 'group' as the > alias name > >> for some transition period (3 or maybe 6 months). > >> We can remove all references to group as an alias from the > >> documentation and print a warning when users use it. But I don't > >> think we should drop it immediately, as we'll break many scripts. > >> > >> Other than that I'm fine with the proposal. > >> > >> Alan. > >> > >> Chris Olston wrote: > >>> No. > >>> > >>> The standing proposal for Option III is: > >>> > >>> 1. If you are (CO)Grouping on a *single* field AND in the case of > >>> co-group all field names are the same (e.g., cogroup A by > url, B by > >>> url), then give the group key that name (e.g., "url"). > >>> 2. Else, do *not* automatically assign any name. The user > >> can refer to > >>> it as $0 and/or use "AS" to give it a name manually. > >>> > >>> (To be clear, even in case #1, the user has the option to > >> override the > >>> automatically-assigned name using "AS" if s/he chooses.) > >>> > >>> -Chris > >>> > >>> > >>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote: > >>> > >>>> I completely agree. It does start getting confusing. > >> Especially if we > >>>> try to deal with multi field keys. > >>>> > >>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load > >>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3' > >>>> USING PigStorage() AS (A, B) > >>>> > >>>> G1 = COGROUP A by (B,C), B by (A, C); > >>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B); > >>>> > >>>> What is the schema for G2? > >>>> > >>>> ben > >>>> > >>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote: > >>>>> So what is the conclusion here ? > >>>>> > >>>>> group key alias == the first variables group by field ? > >>>>> > >>>>> > >>>>> What happens in a case like this then : > >>>>> > >>>>> -- > >>>>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load > >>>>> 'somefile2' USING PigStorage() AS (A, C) C = load > >> 'somefile3' USING > >>>>> PigStorage() AS (A, B) > >>>>> > >>>>> G1 = COGROUP A by B, B by A; > >>>>> G2 = COGROUP A by C, C by A; > >>>>> ... > >>>>> -- > >>>>> > >>>>> A slightly contrived example for sure, but imo grammer > >> has to be as > >>>>> clearly specified as possible. > >>>>> > >>>>> A reserved keyword as group alias implies we dont hit > >> this problem > >>>>> (group or groupkey or grpkey)... and also the fact that we are > >>>>> backwardly compatible. > >>>>> > >>>>> [I never liked inferred schema prefix section in the > schemas doc > >>>>> (which is applied selectively) - makes it extremely tough to > >>>>> generate pig scripts] > >>>>> > >>>>> > >>>>> Regards, > >>>>> Mridul > >>>>> > >>>>> Alan Gates wrote: > >>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is > >> used, the > >>>>>> field (or set of fields) that are grouped on are given > the alias > >>>>>> 'group'. > >>>>>> This has a couple of issues: > >>>>>> > >>>>>> 1) It's confusing. 'group' is now a keyword and an alias. > >>>>>> 2) We don't currently allow 'group' as an alias in an > >> AS. It is > >>>>>> strange to have an alias that can only be assigned by > >> the language > >>>>>> and never by the user. > >>>>>> > >>>>>> Possible solutions: > >>>>>> > >>>>>> I) Status quo. We could fix it so that group is allowed to be > >>>>>> assigned as an alias in AS. > >>>>>> > >>>>>> Pros: Backward compatibility > >>>>>> Cons: a) will make the parser more complicated > >>>>>> b) see 1) above. > >>>>>> > >>>>>> > >>>>>> II) Don't give an implicit alias to the group key(s). > If users > >>>>>> want an alias, they can assign it using AS. > >>>>>> > >>>>>> Pros: Simplicity > >>>>>> Cons: We do assign aliases to grouped bags. That is, > >> if we have C > >>>>>> = GROUP B by $0 the resulting schema of C is (group, B). > >> So if we > >>>>>> don't assign an alias to the group key, we now have a > >> schema ($0, > >>>>>> B). This seems strange. And worse yet, if users want > >> to alias the > >>>>>> group key(s), they'll be forced to alias all the > grouped bags as > >>>>>> well. > >>>>>> > >>>>>> III) Carry the alias (if any) that the field had before. > >> So if we > >>>>>> had a script like: > >>>>>> > >>>>>> A = load 'myfile' as (x, y, z); > >>>>>> B = group A by x; > >>>>>> > >>>>>> The the schema of B would be (x, A). This is quite > natural for > >>>>>> grouping of single columns. But it turns nasty when you > >> group on > >>>>>> multiple columns. Do we then append the names to > >> together? So if > >>>>>> you have > >>>>>> > >>>>>> B = group A by x, y; > >>>>>> > >>>>>> is the resulting schema (x_y, A)? Ugh. > >>>>>> > >>>>>> In this case there is also the question of what to do in > >> the case > >>>>>> of cogroups, where the key may be named differently in > different > >>>>>> relations. > >>>>>> > >>>>>> A = load 'myfile' as (x, y, z); > >>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by > x, B by t; > >>>>>> > >>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are > >> both valid? > >>>>>> This > >>>>>> could be resolved by either saying first one always wins, or > >>>>>> allowing either. > >>>>>> > >>>>>> Pros: Very natural for the users, their fields maintain names > >>>>>> through the query. > >>>>>> Cons: Quickly gets burdensome in the case of multi-key groups. > >>>>>> > >>>>>> IV) Assign a non-keyword alias to the group key, like grp or > >>>>>> groupkey or grpkey (or some other suitable choice). > >>>>>> Pros: Least disruptive change. Users only have to go through > >>>>>> their scripts and find places where they use the group > alias and > >>>>>> change it to grp (or whatever). > >>>>>> Cons: Still leaves us with a situation where we are > assigning a > >>>>>> name to a field arbtrarily, leaving users confused as to > >> how their > >>>>>> fields got named that. > >>>>>> > >>>>>> V) Remove GROUP as a keyword. It is just short for > >> COGROUP of one > >>>>>> relation anyway. > >>>>>> > >>>>>> Pros: Smaller syntax in a language is always good. > >>>>>> Cons: Will break a lot of scripts, and confuse a lot of > >> users who > >>>>>> only think in terms of GROUP and JOIN and never use COGROUP > >>>>>> explicitly. > >>>>>> > >>>>>> One could also conceive of combinations of these. For > >> example, we > >>>>>> always assign a name like grpkey to the group key(s), > and in the > >>>>>> single key case we also carry forward the alias that the field > >>>>>> already had, if any. > >>>>>> > >>>>>> Thoughts? Other possibilities? > >>>>>> > >>>>>> Alan. > >>>> > >>>> > >>> > >>> -- > >>> Christopher Olston, Ph.D. > >>> Sr. Research Scientist > >>> Yahoo! Research > >>> > >>> > >>> > >> > > -- > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > > >
