Good idea.

On Jun 16, 2008, at 11:31 AM, Alan Gates wrote:

I would like to propose a slight modification:

I think that we should continue to support 'group' as the alias name for some transition period (3 or maybe 6 months). We can remove all references to group as an alias from the documentation and print a warning when users use it. But I don't think we should drop it immediately, as we'll break many scripts.

Other than that I'm fine with the proposal.

Alan.

Chris Olston wrote:
No.

The standing proposal for Option III is:

1. If you are (CO)Grouping on a *single* field AND in the case of co-group all field names are the same (e.g., cogroup A by url, B by url), then give the group key that name (e.g., "url"). 2. Else, do *not* automatically assign any name. The user can refer to it as $0 and/or use "AS" to give it a name manually.

(To be clear, even in case #1, the user has the option to override the automatically-assigned name using "AS" if s/he chooses.)

-Chris


On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:

I completely agree. It does start getting confusing. Especially if we try to
deal with multi field keys.

A = load 'somefile1' USING PigStorage() AS (B, C, Z)
B = load 'somefile2' USING PigStorage() AS (A, C, Y)
C = load 'somefile3' USING PigStorage() AS (A, B)

G1 = COGROUP A by (B,C), B by (A, C);
G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);

What is the schema for G2?

ben

On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
So what is the conclusion here ?

group key alias == the first variables group by field ?


What happens in a case like this then :

--
A = load 'somefile1' USING PigStorage() AS (B, C)
B = load 'somefile2' USING PigStorage() AS (A, C)
C = load 'somefile3' USING PigStorage() AS (A, B)

G1 = COGROUP A by B, B by A;
G2 = COGROUP A by C, C by A;
...
--

A slightly contrived example for sure, but imo grammer has to be as
clearly specified as possible.

A reserved keyword as group alias implies we dont hit this problem
(group or groupkey or grpkey)... and also the fact that we are
backwardly compatible.

[I never liked inferred schema prefix section in the schemas doc (which is applied selectively) - makes it extremely tough to generate pig scripts]


Regards,
Mridul

Alan Gates wrote:
Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field (or set of fields) that are grouped on are given the alias 'group'.
This has a couple of issues:

1)  It's confusing.  'group' is now a keyword and an alias.
2)  We don't currently allow 'group' as an alias in an AS.  It is
strange to have an alias that can only be assigned by the language and
never by the user.

Possible solutions:

I) Status quo. We could fix it so that group is allowed to be assigned
as an alias in AS.

Pros:  Backward compatibility
Cons: a) will make the parser more complicated
     b) see 1) above.


II) Don't give an implicit alias to the group key(s). If users want an
alias, they can assign it using AS.

Pros:  Simplicity
Cons: We do assign aliases to grouped bags. That is, if we have C = GROUP B by $0 the resulting schema of C is (group, B). So if we don't assign an alias to the group key, we now have a schema ($0, B). This seems strange. And worse yet, if users want to alias the group key(s),
they'll be forced to alias all the grouped bags as well.

III) Carry the alias (if any) that the field had before. So if we had a
script like:

A = load 'myfile' as (x, y, z);
B = group A by x;

The the schema of B would be (x, A). This is quite natural for grouping
of single columns.  But it turns nasty when you group on multiple
columns.  Do we then append the names to together?  So if you have

B = group A by x, y;

is the resulting schema (x_y, A)?  Ugh.

In this case there is also the question of what to do in the case of cogroups, where the key may be named differently in different relations.

A = load 'myfile' as (x, y, z);
B = load 'myotherfile' as (t, u, v);
C = cogroup A by x, B by t;

Is the resulting schema (x, A, B) or (t, A, B) or are both valid? This could be resolved by either saying first one always wins, or allowing
either.

Pros: Very natural for the users, their fields maintain names through
the query.
Cons:  Quickly gets burdensome in the case of multi-key groups.

IV) Assign a non-keyword alias to the group key, like grp or groupkey or
grpkey (or some other suitable choice).
Pros: Least disruptive change. Users only have to go through their scripts and find places where they use the group alias and change it to
grp (or whatever).
Cons: Still leaves us with a situation where we are assigning a name to a field arbtrarily, leaving users confused as to how their fields got
named that.

V) Remove GROUP as a keyword.  It is just short for COGROUP of one
relation anyway.

Pros:  Smaller syntax in a language is always good.
Cons: Will break a lot of scripts, and confuse a lot of users who only
think in terms of GROUP and JOIN and never use COGROUP explicitly.

One could also conceive of combinations of these.  For example, we
always assign a name like grpkey to the group key(s), and in the single key case we also carry forward the alias that the field already had, if
any.

Thoughts?  Other possibilities?

Alan.



--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research




--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research


Reply via email to