RE: Issues with group as an alias

Olga Natkovich Mon, 16 Jun 2008 13:30:46 -0700

Chris,

What I meant to ask was what do we do with the rest of the fields in the
group tuples. Currently, we name those fields with the names of the
correspondent tables. I was asking if we want to continue that. I know
that people find it confusing to see fields named after relations.


Olga

> -----Original Message-----
> From: Chris Olston [mailto:[EMAIL PROTECTED] 
> Sent: Monday, June 16, 2008 12:54 PM
> To: [email protected]
> Subject: Re: Issues with group as an alias
> 
> Olga,
> 
> The idea is that when there is just one field with one name, 
> we use that name for the group key. In all other cases we do 
> *not* supply an automatic name (the user can assign their own 
> name using "as").
> 
> I believe this solution: (1) is very simple and unambiguous, 
> and (2) makes common cases very natural (e.g, BAR = group FOO 
> by URL; foreach BAR generate URL, ...).
> 
> -Chris
> 
> On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote:
> 
> > What about naming the rest of the fields in the group? Do 
> we want to 
> > continue naming them with the names of the corresponding tables? I 
> > think users find that confusing as well.
> >
> > Olga
> >
> >> -----Original Message-----
> >> From: Alan Gates [mailto:[EMAIL PROTECTED]
> >> Sent: Monday, June 16, 2008 11:32 AM
> >> To: [email protected]
> >> Subject: Re: Issues with group as an alias
> >>
> >> I would like to propose a slight modification:
> >>
> >> I think that we should continue to support 'group' as the 
> alias name 
> >> for some transition period (3 or maybe 6 months).
> >> We can remove all references to group as an alias from the 
> >> documentation and print a warning when users use it.  But I don't 
> >> think we should drop it immediately, as we'll break many scripts.
> >>
> >> Other than that I'm fine with the proposal.
> >>
> >> Alan.
> >>
> >> Chris Olston wrote:
> >>> No.
> >>>
> >>> The standing proposal for Option III is:
> >>>
> >>> 1. If you are (CO)Grouping on a *single* field AND in the case of 
> >>> co-group all field names are the same (e.g., cogroup A by 
> url, B by 
> >>> url), then give the group key that name (e.g., "url").
> >>> 2. Else, do *not* automatically assign any name. The user
> >> can refer to
> >>> it as $0 and/or use "AS" to give it a name manually.
> >>>
> >>> (To be clear, even in case #1, the user has the option to
> >> override the
> >>> automatically-assigned name using "AS" if s/he chooses.)
> >>>
> >>> -Chris
> >>>
> >>>
> >>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
> >>>
> >>>> I completely agree. It does start getting confusing.
> >> Especially if we
> >>>> try to deal with multi field keys.
> >>>>
> >>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load 
> >>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3'
> >>>> USING PigStorage() AS (A, B)
> >>>>
> >>>> G1 = COGROUP A by (B,C), B by (A, C);
> >>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
> >>>>
> >>>> What is the schema for G2?
> >>>>
> >>>> ben
> >>>>
> >>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
> >>>>> So what is the conclusion here ?
> >>>>>
> >>>>> group key alias == the first variables group by field ?
> >>>>>
> >>>>>
> >>>>> What happens in a case like this then :
> >>>>>
> >>>>> --
> >>>>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load 
> >>>>> 'somefile2' USING PigStorage() AS (A, C) C = load
> >> 'somefile3' USING
> >>>>> PigStorage() AS (A, B)
> >>>>>
> >>>>> G1 = COGROUP A by B, B by A;
> >>>>> G2 = COGROUP A by C, C by A;
> >>>>> ...
> >>>>> --
> >>>>>
> >>>>> A slightly contrived example for sure, but imo grammer
> >> has to be as
> >>>>> clearly specified as possible.
> >>>>>
> >>>>> A reserved keyword as group alias implies we dont hit
> >> this problem
> >>>>> (group or groupkey or grpkey)... and also the fact that we are 
> >>>>> backwardly compatible.
> >>>>>
> >>>>> [I never liked inferred schema prefix section in the 
> schemas doc 
> >>>>> (which is applied selectively) - makes it extremely tough to 
> >>>>> generate pig scripts]
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Mridul
> >>>>>
> >>>>> Alan Gates wrote:
> >>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is
> >> used, the
> >>>>>> field (or set of fields) that are grouped on are given 
> the alias 
> >>>>>> 'group'.
> >>>>>> This has a couple of issues:
> >>>>>>
> >>>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
> >>>>>> 2)  We don't currently allow 'group' as an alias in an
> >> AS.  It is
> >>>>>> strange to have an alias that can only be assigned by
> >> the language
> >>>>>> and never by the user.
> >>>>>>
> >>>>>> Possible solutions:
> >>>>>>
> >>>>>> I) Status quo.  We could fix it so that group is allowed to be 
> >>>>>> assigned as an alias in AS.
> >>>>>>
> >>>>>> Pros:  Backward compatibility
> >>>>>> Cons: a) will make the parser more complicated
> >>>>>>      b) see 1) above.
> >>>>>>
> >>>>>>
> >>>>>> II) Don't give an implicit alias to the group key(s).  
> If users 
> >>>>>> want an alias, they can assign it using AS.
> >>>>>>
> >>>>>> Pros:  Simplicity
> >>>>>> Cons:  We do assign aliases to grouped bags.  That is,
> >> if we have C
> >>>>>> = GROUP B by $0 the resulting schema of C is (group, B).
> >>  So if we
> >>>>>> don't assign an alias to the group key, we now have a
> >> schema ($0,
> >>>>>> B).  This seems strange.  And worse yet, if users want
> >> to alias the
> >>>>>> group key(s), they'll be forced to alias all the 
> grouped bags as 
> >>>>>> well.
> >>>>>>
> >>>>>> III) Carry the alias (if any) that the field had before.
> >>  So if we
> >>>>>> had a script like:
> >>>>>>
> >>>>>> A = load 'myfile' as (x, y, z);
> >>>>>> B = group A by x;
> >>>>>>
> >>>>>> The the schema of B would be (x, A).  This is quite 
> natural for 
> >>>>>> grouping of single columns.  But it turns nasty when you
> >> group on
> >>>>>> multiple columns.  Do we then append the names to
> >> together?  So if
> >>>>>> you have
> >>>>>>
> >>>>>> B = group A by x, y;
> >>>>>>
> >>>>>> is the resulting schema (x_y, A)?  Ugh.
> >>>>>>
> >>>>>> In this case there is also the question of what to do in
> >> the case
> >>>>>> of cogroups, where the key may be named differently in 
> different 
> >>>>>> relations.
> >>>>>>
> >>>>>> A = load 'myfile' as (x, y, z);
> >>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by 
> x, B by t;
> >>>>>>
> >>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are
> >> both valid?
> >>>>>> This
> >>>>>> could be resolved by either saying first one always wins, or 
> >>>>>> allowing either.
> >>>>>>
> >>>>>> Pros:  Very natural for the users, their fields maintain names 
> >>>>>> through the query.
> >>>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
> >>>>>>
> >>>>>> IV) Assign a non-keyword alias to the group key, like grp or 
> >>>>>> groupkey or grpkey (or some other suitable choice).
> >>>>>> Pros:  Least disruptive change.  Users only have to go through 
> >>>>>> their scripts and find places where they use the group 
> alias and 
> >>>>>> change it to grp (or whatever).
> >>>>>> Cons:  Still leaves us with a situation where we are 
> assigning a 
> >>>>>> name to a field arbtrarily, leaving users confused as to
> >> how their
> >>>>>> fields got named that.
> >>>>>>
> >>>>>> V) Remove GROUP as a keyword.  It is just short for
> >> COGROUP of one
> >>>>>> relation anyway.
> >>>>>>
> >>>>>> Pros:  Smaller syntax in a language is always good.
> >>>>>> Cons:  Will break a lot of scripts, and confuse a lot of
> >> users who
> >>>>>> only think in terms of GROUP and JOIN and never use COGROUP 
> >>>>>> explicitly.
> >>>>>>
> >>>>>> One could also conceive of combinations of these.  For
> >> example, we
> >>>>>> always assign a name like grpkey to the group key(s), 
> and in the 
> >>>>>> single key case we also carry forward the alias that the field 
> >>>>>> already had, if any.
> >>>>>>
> >>>>>> Thoughts?  Other possibilities?
> >>>>>>
> >>>>>> Alan.
> >>>>
> >>>>
> >>>
> >>> --
> >>> Christopher Olston, Ph.D.
> >>> Sr. Research Scientist
> >>> Yahoo! Research
> >>>
> >>>
> >>>
> >>
> 
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
> 
> 
>

RE: Issues with group as an alias

Reply via email to