What about naming the rest of the fields in the group? Do we want to
continue naming them with the names of the corresponding tables? I think
users find that confusing as well.

Olga 

> -----Original Message-----
> From: Alan Gates [mailto:[EMAIL PROTECTED] 
> Sent: Monday, June 16, 2008 11:32 AM
> To: [email protected]
> Subject: Re: Issues with group as an alias
> 
> I would like to propose a slight modification:
> 
> I think that we should continue to support 'group' as the 
> alias name for some transition period (3 or maybe 6 months).  
> We can remove all references to group as an alias from the 
> documentation and print a warning when users use it.  But I 
> don't think we should drop it immediately, as we'll break 
> many scripts.
> 
> Other than that I'm fine with the proposal.
> 
> Alan.
> 
> Chris Olston wrote:
> > No.
> >
> > The standing proposal for Option III is:
> >
> > 1. If you are (CO)Grouping on a *single* field AND in the case of 
> > co-group all field names are the same (e.g., cogroup A by url, B by 
> > url), then give the group key that name (e.g., "url").
> > 2. Else, do *not* automatically assign any name. The user 
> can refer to 
> > it as $0 and/or use "AS" to give it a name manually.
> >
> > (To be clear, even in case #1, the user has the option to 
> override the 
> > automatically-assigned name using "AS" if s/he chooses.)
> >
> > -Chris
> >
> >
> > On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
> >
> >> I completely agree. It does start getting confusing. 
> Especially if we 
> >> try to deal with multi field keys.
> >>
> >> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load 
> >> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3' 
> >> USING PigStorage() AS (A, B)
> >>
> >> G1 = COGROUP A by (B,C), B by (A, C);
> >> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
> >>
> >> What is the schema for G2?
> >>
> >> ben
> >>
> >> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
> >>> So what is the conclusion here ?
> >>>
> >>> group key alias == the first variables group by field ?
> >>>
> >>>
> >>> What happens in a case like this then :
> >>>
> >>> --
> >>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load 
> >>> 'somefile2' USING PigStorage() AS (A, C) C = load 
> 'somefile3' USING 
> >>> PigStorage() AS (A, B)
> >>>
> >>> G1 = COGROUP A by B, B by A;
> >>> G2 = COGROUP A by C, C by A;
> >>> ...
> >>> --
> >>>
> >>> A slightly contrived example for sure, but imo grammer 
> has to be as 
> >>> clearly specified as possible.
> >>>
> >>> A reserved keyword as group alias implies we dont hit 
> this problem 
> >>> (group or groupkey or grpkey)... and also the fact that we are 
> >>> backwardly compatible.
> >>>
> >>> [I never liked inferred schema prefix section in the schemas doc 
> >>> (which is applied selectively) - makes it extremely tough to 
> >>> generate pig scripts]
> >>>
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>> Alan Gates wrote:
> >>>> Currently in Pig Latin, anytime a (CO)GROUP statement is 
> used, the 
> >>>> field (or set of fields) that are grouped on are given the alias 
> >>>> 'group'.
> >>>> This has a couple of issues:
> >>>>
> >>>> 1)  It's confusing.  'group' is now a keyword and an alias.
> >>>> 2)  We don't currently allow 'group' as an alias in an 
> AS.  It is 
> >>>> strange to have an alias that can only be assigned by 
> the language 
> >>>> and never by the user.
> >>>>
> >>>> Possible solutions:
> >>>>
> >>>> I) Status quo.  We could fix it so that group is allowed to be 
> >>>> assigned as an alias in AS.
> >>>>
> >>>> Pros:  Backward compatibility
> >>>> Cons: a) will make the parser more complicated
> >>>>      b) see 1) above.
> >>>>
> >>>>
> >>>> II) Don't give an implicit alias to the group key(s).  If users 
> >>>> want an alias, they can assign it using AS.
> >>>>
> >>>> Pros:  Simplicity
> >>>> Cons:  We do assign aliases to grouped bags.  That is, 
> if we have C 
> >>>> = GROUP B by $0 the resulting schema of C is (group, B). 
>  So if we 
> >>>> don't assign an alias to the group key, we now have a 
> schema ($0, 
> >>>> B).  This seems strange.  And worse yet, if users want 
> to alias the 
> >>>> group key(s), they'll be forced to alias all the grouped bags as 
> >>>> well.
> >>>>
> >>>> III) Carry the alias (if any) that the field had before. 
>  So if we 
> >>>> had a script like:
> >>>>
> >>>> A = load 'myfile' as (x, y, z);
> >>>> B = group A by x;
> >>>>
> >>>> The the schema of B would be (x, A).  This is quite natural for 
> >>>> grouping of single columns.  But it turns nasty when you 
> group on 
> >>>> multiple columns.  Do we then append the names to 
> together?  So if 
> >>>> you have
> >>>>
> >>>> B = group A by x, y;
> >>>>
> >>>> is the resulting schema (x_y, A)?  Ugh.
> >>>>
> >>>> In this case there is also the question of what to do in 
> the case 
> >>>> of cogroups, where the key may be named differently in different 
> >>>> relations.
> >>>>
> >>>> A = load 'myfile' as (x, y, z);
> >>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by x, B by t;
> >>>>
> >>>> Is the resulting schema (x, A, B) or (t, A, B) or are 
> both valid?  
> >>>> This
> >>>> could be resolved by either saying first one always wins, or 
> >>>> allowing either.
> >>>>
> >>>> Pros:  Very natural for the users, their fields maintain names 
> >>>> through the query.
> >>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
> >>>>
> >>>> IV) Assign a non-keyword alias to the group key, like grp or 
> >>>> groupkey or grpkey (or some other suitable choice).
> >>>> Pros:  Least disruptive change.  Users only have to go through 
> >>>> their scripts and find places where they use the group alias and 
> >>>> change it to grp (or whatever).
> >>>> Cons:  Still leaves us with a situation where we are assigning a 
> >>>> name to a field arbtrarily, leaving users confused as to 
> how their 
> >>>> fields got named that.
> >>>>
> >>>> V) Remove GROUP as a keyword.  It is just short for 
> COGROUP of one 
> >>>> relation anyway.
> >>>>
> >>>> Pros:  Smaller syntax in a language is always good.
> >>>> Cons:  Will break a lot of scripts, and confuse a lot of 
> users who 
> >>>> only think in terms of GROUP and JOIN and never use COGROUP 
> >>>> explicitly.
> >>>>
> >>>> One could also conceive of combinations of these.  For 
> example, we 
> >>>> always assign a name like grpkey to the group key(s), and in the 
> >>>> single key case we also carry forward the alias that the field 
> >>>> already had, if any.
> >>>>
> >>>> Thoughts?  Other possibilities?
> >>>>
> >>>> Alan.
> >>
> >>
> >
> > --
> > Christopher Olston, Ph.D.
> > Sr. Research Scientist
> > Yahoo! Research
> >
> >
> >
> 

Reply via email to