RE: Issues with group as an alias

Olga Natkovich Mon, 16 Jun 2008 10:59:40 -0700

+1.  Makes sense to me


> -----Original Message-----
> From: Chris Olston [mailto:[EMAIL PROTECTED] 
> Sent: Monday, June 16, 2008 10:29 AM
> To: [email protected]
> Subject: Re: Issues with group as an alias
> 
> No.
> 
> The standing proposal for Option III is:
> 
> 1. If you are (CO)Grouping on a *single* field AND in the 
> case of co- group all field names are the same (e.g., cogroup 
> A by url, B by url), then give the group key that name (e.g., "url").
> 2. Else, do *not* automatically assign any name. The user can 
> refer to it as $0 and/or use "AS" to give it a name manually.
> 
> (To be clear, even in case #1, the user has the option to 
> override the automatically-assigned name using "AS" if s/he chooses.)
> 
> -Chris
> 
> 
> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
> 
> > I completely agree. It does start getting confusing. 
> Especially if we 
> > try to deal with multi field keys.
> >
> > A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load 
> > 'somefile2' USING PigStorage() AS (A, C, Y) C = load 
> 'somefile3' USING 
> > PigStorage() AS (A, B)
> >
> > G1 = COGROUP A by (B,C), B by (A, C);
> > G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
> >
> > What is the schema for G2?
> >
> > ben
> >
> > On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
> >> So what is the conclusion here ?
> >>
> >> group key alias == the first variables group by field ?
> >>
> >>
> >> What happens in a case like this then :
> >>
> >> --
> >> A = load 'somefile1' USING PigStorage() AS (B, C) B = load 
> >> 'somefile2' USING PigStorage() AS (A, C) C = load 
> 'somefile3' USING 
> >> PigStorage() AS (A, B)
> >>
> >> G1 = COGROUP A by B, B by A;
> >> G2 = COGROUP A by C, C by A;
> >> ...
> >> --
> >>
> >> A slightly contrived example for sure, but imo grammer has 
> to be as 
> >> clearly specified as possible.
> >>
> >> A reserved keyword as group alias implies we dont hit this problem 
> >> (group or groupkey or grpkey)... and also the fact that we are 
> >> backwardly compatible.
> >>
> >> [I never liked inferred schema prefix section in the schemas doc 
> >> (which is applied selectively) - makes it extremely tough 
> to generate 
> >> pig scripts]
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >> Alan Gates wrote:
> >>> Currently in Pig Latin, anytime a (CO)GROUP statement is 
> used, the 
> >>> field (or set of fields) that are grouped on are given the alias 
> >>> 'group'.
> >>> This has a couple of issues:
> >>>
> >>> 1)  It's confusing.  'group' is now a keyword and an alias.
> >>> 2)  We don't currently allow 'group' as an alias in an AS.  It is 
> >>> strange to have an alias that can only be assigned by the 
> language 
> >>> and never by the user.
> >>>
> >>> Possible solutions:
> >>>
> >>> I) Status quo.  We could fix it so that group is allowed to be 
> >>> assigned as an alias in AS.
> >>>
> >>> Pros:  Backward compatibility
> >>> Cons: a) will make the parser more complicated
> >>>      b) see 1) above.
> >>>
> >>>
> >>> II) Don't give an implicit alias to the group key(s).  If 
> users want 
> >>> an alias, they can assign it using AS.
> >>>
> >>> Pros:  Simplicity
> >>> Cons:  We do assign aliases to grouped bags.  That is, if 
> we have C 
> >>> = GROUP B by $0 the resulting schema of C is (group, B).  
> So if we 
> >>> don't
> >>> assign an alias to the group key, we now have a schema ($0, B).   
> >>> This
> >>> seems strange.  And worse yet, if users want to alias the group 
> >>> key(s), they'll be forced to alias all the grouped bags as well.
> >>>
> >>> III) Carry the alias (if any) that the field had before.  
> So if we 
> >>> had a script like:
> >>>
> >>> A = load 'myfile' as (x, y, z);
> >>> B = group A by x;
> >>>
> >>> The the schema of B would be (x, A).  This is quite natural for 
> >>> grouping of single columns.  But it turns nasty when you group on 
> >>> multiple columns.  Do we then append the names to 
> together?  So if 
> >>> you have
> >>>
> >>> B = group A by x, y;
> >>>
> >>> is the resulting schema (x_y, A)?  Ugh.
> >>>
> >>> In this case there is also the question of what to do in 
> the case of 
> >>> cogroups, where the key may be named differently in different 
> >>> relations.
> >>>
> >>> A = load 'myfile' as (x, y, z);
> >>> B = load 'myotherfile' as (t, u, v); C = cogroup A by x, B by t;
> >>>
> >>> Is the resulting schema (x, A, B) or (t, A, B) or are 
> both valid?  
> >>> This could be resolved by either saying first one always wins, or 
> >>> allowing either.
> >>>
> >>> Pros:  Very natural for the users, their fields maintain names 
> >>> through the query.
> >>> Cons:  Quickly gets burdensome in the case of multi-key groups.
> >>>
> >>> IV) Assign a non-keyword alias to the group key, like grp or 
> >>> groupkey or grpkey (or some other suitable choice).
> >>> Pros:  Least disruptive change.  Users only have to go 
> through their 
> >>> scripts and find places where they use the group alias 
> and change it 
> >>> to grp (or whatever).
> >>> Cons:  Still leaves us with a situation where we are assigning a 
> >>> name to a field arbtrarily, leaving users confused as to 
> how their 
> >>> fields got named that.
> >>>
> >>> V) Remove GROUP as a keyword.  It is just short for 
> COGROUP of one 
> >>> relation anyway.
> >>>
> >>> Pros:  Smaller syntax in a language is always good.
> >>> Cons:  Will break a lot of scripts, and confuse a lot of 
> users who 
> >>> only think in terms of GROUP and JOIN and never use COGROUP 
> >>> explicitly.
> >>>
> >>> One could also conceive of combinations of these.  For 
> example, we 
> >>> always assign a name like grpkey to the group key(s), and in the 
> >>> single key case we also carry forward the alias that the field 
> >>> already had, if any.
> >>>
> >>> Thoughts?  Other possibilities?
> >>>
> >>> Alan.
> >
> >
> 
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
> 
> 
>

RE: Issues with group as an alias

Reply via email to