I think it is not totally orthogonal as the answer to my question can be used to eliminate option (I).
I agree on option (III) due to usability reason. Pi On 6/10/08, Chris Olston <[EMAIL PROTECTED]> wrote: > > The issue of non-reserved keywords is orthogonal to the issue at hand: what > is the most natural way to name the group key (i.e., we can still allow > non-reserved keywords, and select a different way of naming the group key, > if we want). > > Every time I give a tutorial on Pig, people struggle to understand what > this mysterious "group" field is. It is ugly and non-intuitive. > > Option III is far more natural, and will cover 95% of the cases (for the > rest of the cases, the user is doing something complicated so I think it's > okay for them to name the group key manually). > > -Chris > > > On Jun 9, 2008, at 3:39 AM, pi song wrote: > > I prefer (I) and that means I want to allow non-reserved keywords. >> >> On Fri, Jun 6, 2008 at 9:33 AM, pi song <[EMAIL PROTECTED]> wrote: >> >> I know it is very subjective to say I don't agree with "1) It's >>> confusing". On developers' side, it is. But on users' side, it might not. >>> >>> Some languages allow usage of keywords given they are used in the right >>> context. The current Pig implementation also allows referring to "group" >>> as >>> an alias. >>> >>> Before we jump to the solution, shouldn't it be better to make our >>> position clear on "Do we want every keyword to be reserved word >>> regardless >>> of context?" >>> >>> Pi >>> >>> >>> On 6/6/08, Chris Olston <[EMAIL PROTECTED]> wrote: >>> >>>> >>>> I vote for (III) -- propagate the alias. This makes the scripts very >>>> natural and readable, e.g.: >>>> >>>> a = group pages by host; >>>> b = foreach a generate host, count(pages); >>>> >>>> As for what to do in the case of grouping on multiple fields, or >>>> co-group >>>> on differently-named fields, we should *not* assign a default name -- >>>> the >>>> user can choose a name using "AS". >>>> >>>> -Chris >>>> >>>> >>>> On Jun 5, 2008, at 9:10 AM, Alan Gates wrote: >>>> >>>> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field >>>> >>>>> (or set of fields) that are grouped on are given the alias 'group'. >>>>> This >>>>> has a couple of issues: >>>>> >>>>> 1) It's confusing. 'group' is now a keyword and an alias. >>>>> 2) We don't currently allow 'group' as an alias in an AS. It is >>>>> strange >>>>> to have an alias that can only be assigned by the language and never by >>>>> the >>>>> user. >>>>> >>>>> Possible solutions: >>>>> >>>>> I) Status quo. We could fix it so that group is allowed to be assigned >>>>> as an alias in AS. >>>>> >>>>> Pros: Backward compatibility >>>>> Cons: a) will make the parser more complicated >>>>> b) see 1) above. >>>>> >>>>> >>>>> II) Don't give an implicit alias to the group key(s). If users want an >>>>> alias, they can assign it using AS. >>>>> >>>>> Pros: Simplicity >>>>> Cons: We do assign aliases to grouped bags. That is, if we have C = >>>>> GROUP B by $0 the resulting schema of C is (group, B). So if we don't >>>>> assign an alias to the group key, we now have a schema ($0, B). This >>>>> seems >>>>> strange. And worse yet, if users want to alias the group key(s), >>>>> they'll be >>>>> forced to alias all the grouped bags as well. >>>>> >>>>> III) Carry the alias (if any) that the field had before. So if we had >>>>> a >>>>> script like: >>>>> >>>>> A = load 'myfile' as (x, y, z); >>>>> B = group A by x; >>>>> >>>>> The the schema of B would be (x, A). This is quite natural for >>>>> grouping >>>>> of single columns. But it turns nasty when you group on multiple >>>>> columns. >>>>> Do we then append the names to together? So if you have >>>>> >>>>> B = group A by x, y; >>>>> >>>>> is the resulting schema (x_y, A)? Ugh. >>>>> >>>>> In this case there is also the question of what to do in the case of >>>>> cogroups, where the key may be named differently in different >>>>> relations. >>>>> >>>>> A = load 'myfile' as (x, y, z); >>>>> B = load 'myotherfile' as (t, u, v); >>>>> C = cogroup A by x, B by t; >>>>> >>>>> Is the resulting schema (x, A, B) or (t, A, B) or are both valid? This >>>>> could be resolved by either saying first one always wins, or allowing >>>>> either. >>>>> >>>>> Pros: Very natural for the users, their fields maintain names through >>>>> the query. >>>>> Cons: Quickly gets burdensome in the case of multi-key groups. >>>>> >>>>> IV) Assign a non-keyword alias to the group key, like grp or groupkey >>>>> or >>>>> grpkey (or some other suitable choice). >>>>> Pros: Least disruptive change. Users only have to go through their >>>>> scripts and find places where they use the group alias and change it to >>>>> grp >>>>> (or whatever). >>>>> Cons: Still leaves us with a situation where we are assigning a name >>>>> to >>>>> a field arbtrarily, leaving users confused as to how their fields got >>>>> named >>>>> that. >>>>> >>>>> V) Remove GROUP as a keyword. It is just short for COGROUP of one >>>>> relation anyway. >>>>> >>>>> Pros: Smaller syntax in a language is always good. >>>>> Cons: Will break a lot of scripts, and confuse a lot of users who only >>>>> think in terms of GROUP and JOIN and never use COGROUP explicitly. >>>>> >>>>> One could also conceive of combinations of these. For example, we >>>>> always >>>>> assign a name like grpkey to the group key(s), and in the single key >>>>> case we >>>>> also carry forward the alias that the field already had, if any. >>>>> >>>>> Thoughts? Other possibilities? >>>>> >>>>> Alan. >>>>> >>>>> >>>> -- >>>> Christopher Olston, Ph.D. >>>> Sr. Research Scientist >>>> Yahoo! Research >>>> >>>> >>>> >>>> >>> > -- > Christopher Olston, Ph.D. > Sr. Research Scientist > Yahoo! Research > > >
