I agree with Pi. +1 for (1). ben
On Tuesday 17 June 2008 03:41:13 pi song wrote: > If it's confusing because our model is different, people just have to > learn. If it's confusing because it is misleading, it has to be fixed. > > As far as we can explain "why" logically, I think it should be ok. > I vote (1) for this. > > On Tue, Jun 17, 2008 at 8:03 AM, Chris Olston <[EMAIL PROTECTED]> wrote: > > Oh -- sorry I misunderstood. > > > > That's a valid question and now is the right time to revisit it. Does > > anybody see any natural naming convention *other than* naming them after > > the input tables (pig's current practice)? If so, let's discuss. If not, > > it seems the only two choices are: (1) leave it as-is, or (2) do not > > assign any name, and force user to use "AS" (this is what Jaql does I > > believe). > > > > -Chris > > > > > > On Jun 16, 2008, at 1:29 PM, Olga Natkovich wrote: > > > > Chris, > > > >> What I meant to ask was what do we do with the rest of the fields in the > >> group tuples. Currently, we name those fields with the names of the > >> correspondent tables. I was asking if we want to continue that. I know > >> that people find it confusing to see fields named after relations. > >> > >> Olga > >> > >> -----Original Message----- > >> > >>> From: Chris Olston [mailto:[EMAIL PROTECTED] > >>> Sent: Monday, June 16, 2008 12:54 PM > >>> To: [email protected] > >>> Subject: Re: Issues with group as an alias > >>> > >>> Olga, > >>> > >>> The idea is that when there is just one field with one name, > >>> we use that name for the group key. In all other cases we do > >>> *not* supply an automatic name (the user can assign their own > >>> name using "as"). > >>> > >>> I believe this solution: (1) is very simple and unambiguous, > >>> and (2) makes common cases very natural (e.g, BAR = group FOO > >>> by URL; foreach BAR generate URL, ...). > >>> > >>> -Chris > >>> > >>> On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote: > >>> > >>> What about naming the rest of the fields in the group? Do > >>> > >>> we want to > >>> > >>>> continue naming them with the names of the corresponding tables? I > >>>> think users find that confusing as well. > >>>> > >>>> Olga > >>>> > >>>> -----Original Message----- > >>>> > >>>>> From: Alan Gates [mailto:[EMAIL PROTECTED] > >>>>> Sent: Monday, June 16, 2008 11:32 AM > >>>>> To: [email protected] > >>>>> Subject: Re: Issues with group as an alias > >>>>> > >>>>> I would like to propose a slight modification: > >>>>> > >>>>> I think that we should continue to support 'group' as the > >>>> > >>>> alias name > >>>> > >>>> for some transition period (3 or maybe 6 months). > >>>> > >>>>> We can remove all references to group as an alias from the > >>>>> documentation and print a warning when users use it. But I don't > >>>>> think we should drop it immediately, as we'll break many scripts. > >>>>> > >>>>> Other than that I'm fine with the proposal. > >>>>> > >>>>> Alan. > >>>>> > >>>>> Chris Olston wrote: > >>>>>> No. > >>>>>> > >>>>>> The standing proposal for Option III is: > >>>>>> > >>>>>> 1. If you are (CO)Grouping on a *single* field AND in the case of > >>>>>> co-group all field names are the same (e.g., cogroup A by > >>>>> > >>>>> url, B by > >>>> > >>>> url), then give the group key that name (e.g., "url"). > >>>> > >>>>>> 2. Else, do *not* automatically assign any name. The user > >>>>> > >>>>> can refer to > >>>>> > >>>>>> it as $0 and/or use "AS" to give it a name manually. > >>>>>> > >>>>>> (To be clear, even in case #1, the user has the option to > >>>>> > >>>>> override the > >>>>> > >>>>>> automatically-assigned name using "AS" if s/he chooses.) > >>>>>> > >>>>>> -Chris > >>>>>> > >>>>>> > >>>>>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote: > >>>>>> > >>>>>> I completely agree. It does start getting confusing. > >>>>>> > >>>>>> Especially if we > >>>>>> > >>>>>> try to deal with multi field keys. > >>>>>> > >>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load > >>>>>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3' > >>>>>>> USING PigStorage() AS (A, B) > >>>>>>> > >>>>>>> G1 = COGROUP A by (B,C), B by (A, C); > >>>>>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B); > >>>>>>> > >>>>>>> What is the schema for G2? > >>>>>>> > >>>>>>> ben > >>>>>>> > >>>>>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote: > >>>>>>>> So what is the conclusion here ? > >>>>>>>> > >>>>>>>> group key alias == the first variables group by field ? > >>>>>>>> > >>>>>>>> > >>>>>>>> What happens in a case like this then : > >>>>>>>> > >>>>>>>> -- > >>>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load > >>>>>>>> 'somefile2' USING PigStorage() AS (A, C) C = load > >>>>>>> > >>>>>>> 'somefile3' USING > >>>>>> > >>>>>> PigStorage() AS (A, B) > >>>>>> > >>>>>>>> G1 = COGROUP A by B, B by A; > >>>>>>>> G2 = COGROUP A by C, C by A; > >>>>>>>> ... > >>>>>>>> -- > >>>>>>>> > >>>>>>>> A slightly contrived example for sure, but imo grammer > >>>>>>> > >>>>>>> has to be as > >>>>>> > >>>>>> clearly specified as possible. > >>>>>> > >>>>>>>> A reserved keyword as group alias implies we dont hit > >>>>>>> > >>>>>>> this problem > >>>>>> > >>>>>> (group or groupkey or grpkey)... and also the fact that we are > >>>>>> > >>>>>>>> backwardly compatible. > >>>>>>>> > >>>>>>>> [I never liked inferred schema prefix section in the > >>>>>>> > >>>>>>> schemas doc > >>>> > >>>> (which is applied selectively) - makes it extremely tough to > >>>> > >>>>>>>> generate pig scripts] > >>>>>>>> > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Mridul > >>>>>>>> > >>>>>>>> Alan Gates wrote: > >>>>>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is > >>>>>>>> > >>>>>>>> used, the > >>>>>> > >>>>>> field (or set of fields) that are grouped on are given > >>>>>> > >>>>>>>> the alias > >>>> > >>>> 'group'. > >>>> > >>>>>>>>> This has a couple of issues: > >>>>>>>>> > >>>>>>>>> 1) It's confusing. 'group' is now a keyword and an alias. > >>>>>>>>> 2) We don't currently allow 'group' as an alias in an > >>>>>>>> > >>>>>>>> AS. It is > >>>>>> > >>>>>> strange to have an alias that can only be assigned by > >>>>>> > >>>>>>>> the language > >>>>>> > >>>>>> and never by the user. > >>>>>> > >>>>>>>>> Possible solutions: > >>>>>>>>> > >>>>>>>>> I) Status quo. We could fix it so that group is allowed to be > >>>>>>>>> assigned as an alias in AS. > >>>>>>>>> > >>>>>>>>> Pros: Backward compatibility > >>>>>>>>> Cons: a) will make the parser more complicated > >>>>>>>>> b) see 1) above. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> II) Don't give an implicit alias to the group key(s). > >>>>>>>> > >>>>>>>> If users > >>>> > >>>> want an alias, they can assign it using AS. > >>>> > >>>>>>>>> Pros: Simplicity > >>>>>>>>> Cons: We do assign aliases to grouped bags. That is, > >>>>>>>> > >>>>>>>> if we have C > >>>>>> > >>>>>> = GROUP B by $0 the resulting schema of C is (group, B). > >>>>>> > >>>>>>>> So if we > >>>>>> > >>>>>> don't assign an alias to the group key, we now have a > >>>>>> > >>>>>>>> schema ($0, > >>>>>> > >>>>>> B). This seems strange. And worse yet, if users want > >>>>>> > >>>>>>>> to alias the > >>>>>> > >>>>>> group key(s), they'll be forced to alias all the > >>>>>> > >>>>>>>> grouped bags as > >>>> > >>>> well. > >>>> > >>>>>>>>> III) Carry the alias (if any) that the field had before. > >>>>>>>> > >>>>>>>> So if we > >>>>>> > >>>>>> had a script like: > >>>>>>>>> A = load 'myfile' as (x, y, z); > >>>>>>>>> B = group A by x; > >>>>>>>>> > >>>>>>>>> The the schema of B would be (x, A). This is quite > >>>>>>>> > >>>>>>>> natural for > >>>> > >>>> grouping of single columns. But it turns nasty when you > >>>> > >>>>>>>> group on > >>>>>> > >>>>>> multiple columns. Do we then append the names to > >>>>>> > >>>>>>>> together? So if > >>>>>> > >>>>>> you have > >>>>>> > >>>>>>>>> B = group A by x, y; > >>>>>>>>> > >>>>>>>>> is the resulting schema (x_y, A)? Ugh. > >>>>>>>>> > >>>>>>>>> In this case there is also the question of what to do in > >>>>>>>> > >>>>>>>> the case > >>>>>> > >>>>>> of cogroups, where the key may be named differently in > >>>>>> > >>>>>>>> different > >>>> > >>>> relations. > >>>> > >>>>>>>>> A = load 'myfile' as (x, y, z); > >>>>>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by > >>>>>>>> > >>>>>>>> x, B by t; > >>>>>>>> > >>>>>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are > >>>>>>>> > >>>>>>>> both valid? > >>>>>> > >>>>>> This > >>>>>> > >>>>>>>>> could be resolved by either saying first one always wins, or > >>>>>>>>> allowing either. > >>>>>>>>> > >>>>>>>>> Pros: Very natural for the users, their fields maintain names > >>>>>>>>> through the query. > >>>>>>>>> Cons: Quickly gets burdensome in the case of multi-key groups. > >>>>>>>>> > >>>>>>>>> IV) Assign a non-keyword alias to the group key, like grp or > >>>>>>>>> groupkey or grpkey (or some other suitable choice). > >>>>>>>>> Pros: Least disruptive change. Users only have to go through > >>>>>>>>> their scripts and find places where they use the group > >>>>>>>> > >>>>>>>> alias and > >>>> > >>>> change it to grp (or whatever). > >>>> > >>>>>>>>> Cons: Still leaves us with a situation where we are > >>>>>>>> > >>>>>>>> assigning a > >>>> > >>>> name to a field arbtrarily, leaving users confused as to > >>>> > >>>>>>>> how their > >>>>>> > >>>>>> fields got named that. > >>>>>> > >>>>>>>>> V) Remove GROUP as a keyword. It is just short for > >>>>>>>> > >>>>>>>> COGROUP of one > >>>>>> > >>>>>> relation anyway. > >>>>>> > >>>>>>>>> Pros: Smaller syntax in a language is always good. > >>>>>>>>> Cons: Will break a lot of scripts, and confuse a lot of > >>>>>>>> > >>>>>>>> users who > >>>>>> > >>>>>> only think in terms of GROUP and JOIN and never use COGROUP > >>>>>> > >>>>>>>>> explicitly. > >>>>>>>>> > >>>>>>>>> One could also conceive of combinations of these. For > >>>>>>>> > >>>>>>>> example, we > >>>>>> > >>>>>> always assign a name like grpkey to the group key(s), > >>>>>> > >>>>>>>> and in the > >>>> > >>>> single key case we also carry forward the alias that the field > >>>> > >>>>>>>>> already had, if any. > >>>>>>>>> > >>>>>>>>> Thoughts? Other possibilities? > >>>>>>>>> > >>>>>>>>> Alan. > >>>>>> > >>>>>> -- > >>>>>> Christopher Olston, Ph.D. > >>>>>> Sr. Research Scientist > >>>>>> Yahoo! Research > >>> > >>> -- > >>> Christopher Olston, Ph.D. > >>> Sr. Research Scientist > >>> Yahoo! Research > > > > -- > > Christopher Olston, Ph.D. > > Sr. Research Scientist > > Yahoo! Research
