Re: Issues with group as an alias

Chris Olston Mon, 09 Jun 2008 08:14:46 -0700

The issue of non-reserved keywords is orthogonal to the issue athand: what is the most natural way to name the group key (i.e., wecan still allow non-reserved keywords, and select a different way ofnaming the group key, if we want).

Every time I give a tutorial on Pig, people struggle to understandwhat this mysterious "group" field is. It is ugly and non-intuitive.

Option III is far more natural, and will cover 95% of the cases (forthe rest of the cases, the user is doing something complicated so Ithink it's okay for them to name the group key manually).


-Chris


On Jun 9, 2008, at 3:39 AM, pi song wrote:

I prefer (I) and that means I want to allow non-reserved keywords.

On Fri, Jun 6, 2008 at 9:33 AM, pi song <[EMAIL PROTECTED]> wrote:
I know it is very subjective to say I don't agree with "1)  It's
confusing". On developers' side, it is. But on users' side, itmight not.
Some languages allow usage of keywords given they are used in therightcontext. The current Pig implementation also allows referring to"group" as
an alias.

Before we jump to the solution, shouldn't it be better to make our
position clear on "Do we want every keyword to be reserved wordregardless
of context?"

Pi


On 6/6/08, Chris Olston <[EMAIL PROTECTED]> wrote:
I vote for (III) -- propagate the alias. This makes the scripts very
natural and readable, e.g.:

a = group pages by host;
b = foreach a generate host, count(pages);
As for what to do in the case of grouping on multiple fields, orco-groupon differently-named fields, we should *not* assign a defaultname -- the
user can choose a name using "AS".

-Chris


On Jun 5, 2008, at 9:10 AM, Alan Gates wrote:
Currently in Pig Latin, anytime a (CO)GROUP statement is used,the field
(or set of fields) that are grouped on are given the alias'group'. This
has a couple of issues:

1)  It's confusing.  'group' is now a keyword and an alias.
2) We don't currently allow 'group' as an alias in an AS. Itis strangeto have an alias that can only be assigned by the language andnever by the
user.

Possible solutions:
I) Status quo. We could fix it so that group is allowed to beassigned
as an alias in AS.

Pros:  Backward compatibility
Cons: a) will make the parser more complicated
    b) see 1) above.
II) Don't give an implicit alias to the group key(s). If userswant an
alias, they can assign it using AS.

Pros:  Simplicity
Cons: We do assign aliases to grouped bags. That is, if wehave C =GROUP B by $0 the resulting schema of C is (group, B). So if wedon'tassign an alias to the group key, we now have a schema ($0, B).This seemsstrange. And worse yet, if users want to alias the group key(s), they'll be
forced to alias all the grouped bags as well.
III) Carry the alias (if any) that the field had before. So ifwe had a
script like:

A = load 'myfile' as (x, y, z);
B = group A by x;
The the schema of B would be (x, A). This is quite natural forgroupingof single columns. But it turns nasty when you group onmultiple columns.
 Do we then append the names to together?  So if you have

B = group A by x, y;

is the resulting schema (x_y, A)?  Ugh.
In this case there is also the question of what to do in thecase ofcogroups, where the key may be named differently in differentrelations.
A = load 'myfile' as (x, y, z);
B = load 'myotherfile' as (t, u, v);
C = cogroup A by x, B by t;
Is the resulting schema (x, A, B) or (t, A, B) or are bothvalid? Thiscould be resolved by either saying first one always wins, orallowing
either.
Pros: Very natural for the users, their fields maintain namesthrough
the query.
Cons:  Quickly gets burdensome in the case of multi-key groups.
IV) Assign a non-keyword alias to the group key, like grp orgroupkey or
grpkey (or some other suitable choice).
Pros: Least disruptive change. Users only have to go throughtheirscripts and find places where they use the group alias andchange it to grp
(or whatever).
Cons: Still leaves us with a situation where we are assigning aname toa field arbtrarily, leaving users confused as to how theirfields got named
that.

V) Remove GROUP as a keyword.  It is just short for COGROUP of one
relation anyway.

Pros:  Smaller syntax in a language is always good.
Cons: Will break a lot of scripts, and confuse a lot of userswho only
think in terms of GROUP and JOIN and never use COGROUP explicitly.
One could also conceive of combinations of these. For example,we alwaysassign a name like grpkey to the group key(s), and in the singlekey case we
also carry forward the alias that the field already had, if any.

Thoughts?  Other possibilities?

Alan.
--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: Issues with group as an alias

Reply via email to