Re: Issues with group as an alias

Chris Olston Thu, 05 Jun 2008 11:13:32 -0700

I vote for (III) -- propagate the alias. This makes the scripts verynatural and readable, e.g.:


a = group pages by host;
b = foreach a generate host, count(pages);

As for what to do in the case of grouping on multiple fields, or co-group on differently-named fields, we should *not* assign a defaultname -- the user can choose a name using "AS".


-Chris


On Jun 5, 2008, at 9:10 AM, Alan Gates wrote:

Currently in Pig Latin, anytime a (CO)GROUP statement is used, thefield (or set of fields) that are grouped on are given the alias'group'. This has a couple of issues:
1)  It's confusing.  'group' is now a keyword and an alias.
2) We don't currently allow 'group' as an alias in an AS. It isstrange to have an alias that can only be assigned by the languageand never by the user.
Possible solutions:
I) Status quo. We could fix it so that group is allowed to beassigned as an alias in AS.
Pros:  Backward compatibility
Cons: a) will make the parser more complicated
     b) see 1) above.
II) Don't give an implicit alias to the group key(s). If userswant an alias, they can assign it using AS.
Pros:  Simplicity
Cons: We do assign aliases to grouped bags. That is, if we have C= GROUP B by $0 the resulting schema of C is (group, B). So if wedon't assign an alias to the group key, we now have a schema ($0,B). This seems strange. And worse yet, if users want to alias thegroup key(s), they'll be forced to alias all the grouped bags as well.
III) Carry the alias (if any) that the field had before. So if wehad a script like:
A = load 'myfile' as (x, y, z);
B = group A by x;
The the schema of B would be (x, A). This is quite natural forgrouping of single columns. But it turns nasty when you group onmultiple columns. Do we then append the names to together? So ifyou have
B = group A by x, y;

is the resulting schema (x_y, A)?  Ugh.
In this case there is also the question of what to do in the caseof cogroups, where the key may be named differently in differentrelations.
A = load 'myfile' as (x, y, z);
B = load 'myotherfile' as (t, u, v);
C = cogroup A by x, B by t;
Is the resulting schema (x, A, B) or (t, A, B) or are both valid?This could be resolved by either saying first one always wins, orallowing either.
Pros: Very natural for the users, their fields maintain namesthrough the query.
Cons:  Quickly gets burdensome in the case of multi-key groups.
IV) Assign a non-keyword alias to the group key, like grp orgroupkey or grpkey (or some other suitable choice).Pros: Least disruptive change. Users only have to go throughtheir scripts and find places where they use the group alias andchange it to grp (or whatever).Cons: Still leaves us with a situation where we are assigning aname to a field arbtrarily, leaving users confused as to how theirfields got named that.
V) Remove GROUP as a keyword. It is just short for COGROUP of onerelation anyway.
Pros:  Smaller syntax in a language is always good.
Cons: Will break a lot of scripts, and confuse a lot of users whoonly think in terms of GROUP and JOIN and never use COGROUPexplicitly.
One could also conceive of combinations of these. For example, wealways assign a name like grpkey to the group key(s), and in thesingle key case we also carry forward the alias that the fieldalready had, if any.
Thoughts?  Other possibilities?

Alan.


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: Issues with group as an alias

Reply via email to