(or set of fields) that are grouped on are given the alias
'group'. This
has a couple of issues:
1) It's confusing. 'group' is now a keyword and an alias.
2) We don't currently allow 'group' as an alias in an AS. It
is strange
to have an alias that can only be assigned by the language and
never by the
user.
Possible solutions:
I) Status quo. We could fix it so that group is allowed to be
assigned
as an alias in AS.
Pros: Backward compatibility
Cons: a) will make the parser more complicated
b) see 1) above.
II) Don't give an implicit alias to the group key(s). If users
want an
alias, they can assign it using AS.
Pros: Simplicity
Cons: We do assign aliases to grouped bags. That is, if we
have C =
GROUP B by $0 the resulting schema of C is (group, B). So if we
don't
assign an alias to the group key, we now have a schema ($0, B).
This seems
strange. And worse yet, if users want to alias the group key
(s), they'll be
forced to alias all the grouped bags as well.
III) Carry the alias (if any) that the field had before. So if
we had a
script like:
A = load 'myfile' as (x, y, z);
B = group A by x;
The the schema of B would be (x, A). This is quite natural for
grouping
of single columns. But it turns nasty when you group on
multiple columns.
Do we then append the names to together? So if you have
B = group A by x, y;
is the resulting schema (x_y, A)? Ugh.
In this case there is also the question of what to do in the
case of
cogroups, where the key may be named differently in different
relations.
A = load 'myfile' as (x, y, z);
B = load 'myotherfile' as (t, u, v);
C = cogroup A by x, B by t;
Is the resulting schema (x, A, B) or (t, A, B) or are both
valid? This
could be resolved by either saying first one always wins, or
allowing
either.
Pros: Very natural for the users, their fields maintain names
through
the query.
Cons: Quickly gets burdensome in the case of multi-key groups.
IV) Assign a non-keyword alias to the group key, like grp or
groupkey or
grpkey (or some other suitable choice).
Pros: Least disruptive change. Users only have to go through
their
scripts and find places where they use the group alias and
change it to grp
(or whatever).
Cons: Still leaves us with a situation where we are assigning a
name to
a field arbtrarily, leaving users confused as to how their
fields got named
that.
V) Remove GROUP as a keyword. It is just short for COGROUP of one
relation anyway.
Pros: Smaller syntax in a language is always good.
Cons: Will break a lot of scripts, and confuse a lot of users
who only
think in terms of GROUP and JOIN and never use COGROUP explicitly.
One could also conceive of combinations of these. For example,
we always
assign a name like grpkey to the group key(s), and in the single
key case we
also carry forward the alias that the field already had, if any.
Thoughts? Other possibilities?
Alan.