Currently in pig, aliases are generally only assigned by the user.
There is one exception to this rule, which is (co)group. Consider a
script like:
a = load 'myfile';
b = load 'anotherfile';
c = cogroup a by $0, b by $0;
The relation c will have the aliases: group, a, b without the user
having assigned those names.
There are a couple of problems with this. First, we've had a number of
users complain that this is confusing. a and b are suddenly overloaded
terms in the script. Consider, for example, that both of the following
lines are possible and refer to entirely different meanings for 'a':
d = filter a by $0 eq 'fred';
d = foreach c generate count(a);
In the first line, 'a' refers to the relation produced by the load. In
the second, it refers to the bag that is the second field ($1) of the
relation 'c'. The same holds for 'group' which is now both a keyword
and an alias (yuck!).
The second issue is that this is generally inconsistent. Everywhere
else pig latin allows users to define aliases, but here it does it
automatically.
So the proposal is to remove this automatic aliasing from cogroup.
Cogroup would support AS, so that users could define aliases for these
bags if they desired. This may be a little difficult, as users need to
remember to provide an alias for the group before aliasing the bags.
For example, taking the script above:
c = cogroup a by $0, b by $0 as name, file1, file2;
So name would now be the alias for the group key (formerly aliased as
'group'), file1 for the first bag (formerly 'a') and file2 for the
second bag (formerly 'b').
Everything said in this applies to group as well as cogroup.
Obviously this change isn't backward compatible.
Thoughts?
Alan.