I completely agree. It does start getting confusing. Especially if we try to 
deal with multi field keys.

A = load 'somefile1' USING PigStorage() AS (B, C, Z)
B = load 'somefile2' USING PigStorage() AS (A, C, Y)
C = load 'somefile3' USING PigStorage() AS (A, B)

G1 = COGROUP A by (B,C), B by (A, C);
G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);

What is the schema for G2?

ben

On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
> So what is the conclusion here ?
>
> group key alias == the first variables group by field ?
>
>
> What happens in a case like this then :
>
> --
> A = load 'somefile1' USING PigStorage() AS (B, C)
> B = load 'somefile2' USING PigStorage() AS (A, C)
> C = load 'somefile3' USING PigStorage() AS (A, B)
>
> G1 = COGROUP A by B, B by A;
> G2 = COGROUP A by C, C by A;
> ...
> --
>
> A slightly contrived example for sure, but imo grammer has to be as
> clearly specified as possible.
>
> A reserved keyword as group alias implies we dont hit this problem
> (group or groupkey or grpkey)... and also the fact that we are
> backwardly compatible.
>
> [I never liked inferred schema prefix section in the schemas doc (which
> is applied selectively) - makes it extremely tough to generate pig scripts]
>
>
> Regards,
> Mridul
>
> Alan Gates wrote:
> > Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field
> > (or set of fields) that are grouped on are given the alias 'group'.
> > This has a couple of issues:
> >
> > 1)  It's confusing.  'group' is now a keyword and an alias.
> > 2)  We don't currently allow 'group' as an alias in an AS.  It is
> > strange to have an alias that can only be assigned by the language and
> > never by the user.
> >
> > Possible solutions:
> >
> > I) Status quo.  We could fix it so that group is allowed to be assigned
> > as an alias in AS.
> >
> > Pros:  Backward compatibility
> > Cons: a) will make the parser more complicated
> >      b) see 1) above.
> >
> >
> > II) Don't give an implicit alias to the group key(s).  If users want an
> > alias, they can assign it using AS.
> >
> > Pros:  Simplicity
> > Cons:  We do assign aliases to grouped bags.  That is, if we have C =
> > GROUP B by $0 the resulting schema of C is (group, B).  So if we don't
> > assign an alias to the group key, we now have a schema ($0, B).  This
> > seems strange.  And worse yet, if users want to alias the group key(s),
> > they'll be forced to alias all the grouped bags as well.
> >
> > III) Carry the alias (if any) that the field had before.  So if we had a
> > script like:
> >
> > A = load 'myfile' as (x, y, z);
> > B = group A by x;
> >
> > The the schema of B would be (x, A).  This is quite natural for grouping
> > of single columns.  But it turns nasty when you group on multiple
> > columns.  Do we then append the names to together?  So if you have
> >
> > B = group A by x, y;
> >
> > is the resulting schema (x_y, A)?  Ugh.
> >
> > In this case there is also the question of what to do in the case of
> > cogroups, where the key may be named differently in different relations.
> >
> > A = load 'myfile' as (x, y, z);
> > B = load 'myotherfile' as (t, u, v);
> > C = cogroup A by x, B by t;
> >
> > Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  This
> > could be resolved by either saying first one always wins, or allowing
> > either.
> >
> > Pros:  Very natural for the users, their fields maintain names through
> > the query.
> > Cons:  Quickly gets burdensome in the case of multi-key groups.
> >
> > IV) Assign a non-keyword alias to the group key, like grp or groupkey or
> > grpkey (or some other suitable choice).
> > Pros:  Least disruptive change.  Users only have to go through their
> > scripts and find places where they use the group alias and change it to
> > grp (or whatever).
> > Cons:  Still leaves us with a situation where we are assigning a name to
> > a field arbtrarily, leaving users confused as to how their fields got
> > named that.
> >
> > V) Remove GROUP as a keyword.  It is just short for COGROUP of one
> > relation anyway.
> >
> > Pros:  Smaller syntax in a language is always good.
> > Cons:  Will break a lot of scripts, and confuse a lot of users who only
> > think in terms of GROUP and JOIN and never use COGROUP explicitly.
> >
> > One could also conceive of combinations of these.  For example, we
> > always assign a name like grpkey to the group key(s), and in the single
> > key case we also carry forward the alias that the field already had, if
> > any.
> >
> > Thoughts?  Other possibilities?
> >
> > Alan.


Reply via email to