I think it is not totally orthogonal  as the answer to my question can be
used to eliminate option (I).

I agree on option (III) due to usability reason.

Pi


On 6/10/08, Chris Olston <[EMAIL PROTECTED]> wrote:
>
> The issue of non-reserved keywords is orthogonal to the issue at hand: what
> is the most natural way to name the group key (i.e., we can still allow
> non-reserved keywords, and select a different way of naming the group key,
> if we want).
>
> Every time I give a tutorial on Pig, people struggle to understand what
> this mysterious "group" field is. It is ugly and non-intuitive.
>
> Option III is far more natural, and will cover 95% of the cases (for the
> rest of the cases, the user is doing something complicated so I think it's
> okay for them to name the group key manually).
>
> -Chris
>
>
> On Jun 9, 2008, at 3:39 AM, pi song wrote:
>
> I prefer (I) and that means I want to allow non-reserved keywords.
>>
>> On Fri, Jun 6, 2008 at 9:33 AM, pi song <[EMAIL PROTECTED]> wrote:
>>
>> I know it is very subjective to say I don't agree with "1)  It's
>>> confusing". On developers' side, it is. But on users' side, it might not.
>>>
>>> Some languages allow usage of keywords given they are used in the right
>>> context. The current Pig implementation also allows referring to "group"
>>> as
>>> an alias.
>>>
>>> Before we jump to the solution, shouldn't it be better to make our
>>> position clear on "Do we want every keyword to be reserved word
>>> regardless
>>> of context?"
>>>
>>> Pi
>>>
>>>
>>> On 6/6/08, Chris Olston <[EMAIL PROTECTED]> wrote:
>>>
>>>>
>>>> I vote for (III) -- propagate the alias. This makes the scripts very
>>>> natural and readable, e.g.:
>>>>
>>>> a = group pages by host;
>>>> b = foreach a generate host, count(pages);
>>>>
>>>> As for what to do in the case of grouping on multiple fields, or
>>>> co-group
>>>> on differently-named fields, we should *not* assign a default name --
>>>> the
>>>> user can choose a name using "AS".
>>>>
>>>> -Chris
>>>>
>>>>
>>>> On Jun 5, 2008, at 9:10 AM, Alan Gates wrote:
>>>>
>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field
>>>>
>>>>> (or set of fields) that are grouped on are given the alias 'group'.
>>>>>  This
>>>>> has a couple of issues:
>>>>>
>>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
>>>>> 2)  We don't currently allow 'group' as an alias in an AS.  It is
>>>>> strange
>>>>> to have an alias that can only be assigned by the language and never by
>>>>> the
>>>>> user.
>>>>>
>>>>> Possible solutions:
>>>>>
>>>>> I) Status quo.  We could fix it so that group is allowed to be assigned
>>>>> as an alias in AS.
>>>>>
>>>>> Pros:  Backward compatibility
>>>>> Cons: a) will make the parser more complicated
>>>>>    b) see 1) above.
>>>>>
>>>>>
>>>>> II) Don't give an implicit alias to the group key(s).  If users want an
>>>>> alias, they can assign it using AS.
>>>>>
>>>>> Pros:  Simplicity
>>>>> Cons:  We do assign aliases to grouped bags.  That is, if we have C =
>>>>> GROUP B by $0 the resulting schema of C is (group, B).  So if we don't
>>>>> assign an alias to the group key, we now have a schema ($0, B).  This
>>>>> seems
>>>>> strange.  And worse yet, if users want to alias the group key(s),
>>>>> they'll be
>>>>> forced to alias all the grouped bags as well.
>>>>>
>>>>> III) Carry the alias (if any) that the field had before.  So if we had
>>>>> a
>>>>> script like:
>>>>>
>>>>> A = load 'myfile' as (x, y, z);
>>>>> B = group A by x;
>>>>>
>>>>> The the schema of B would be (x, A).  This is quite natural for
>>>>> grouping
>>>>> of single columns.  But it turns nasty when you group on multiple
>>>>> columns.
>>>>>  Do we then append the names to together?  So if you have
>>>>>
>>>>> B = group A by x, y;
>>>>>
>>>>> is the resulting schema (x_y, A)?  Ugh.
>>>>>
>>>>> In this case there is also the question of what to do in the case of
>>>>> cogroups, where the key may be named differently in different
>>>>> relations.
>>>>>
>>>>> A = load 'myfile' as (x, y, z);
>>>>> B = load 'myotherfile' as (t, u, v);
>>>>> C = cogroup A by x, B by t;
>>>>>
>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  This
>>>>> could be resolved by either saying first one always wins, or allowing
>>>>> either.
>>>>>
>>>>> Pros:  Very natural for the users, their fields maintain names through
>>>>> the query.
>>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>>>>
>>>>> IV) Assign a non-keyword alias to the group key, like grp or groupkey
>>>>> or
>>>>> grpkey (or some other suitable choice).
>>>>> Pros:  Least disruptive change.  Users only have to go through their
>>>>> scripts and find places where they use the group alias and change it to
>>>>> grp
>>>>> (or whatever).
>>>>> Cons:  Still leaves us with a situation where we are assigning a name
>>>>> to
>>>>> a field arbtrarily, leaving users confused as to how their fields got
>>>>> named
>>>>> that.
>>>>>
>>>>> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
>>>>> relation anyway.
>>>>>
>>>>> Pros:  Smaller syntax in a language is always good.
>>>>> Cons:  Will break a lot of scripts, and confuse a lot of users who only
>>>>> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>>>>>
>>>>> One could also conceive of combinations of these.  For example, we
>>>>> always
>>>>> assign a name like grpkey to the group key(s), and in the single key
>>>>> case we
>>>>> also carry forward the alias that the field already had, if any.
>>>>>
>>>>> Thoughts?  Other possibilities?
>>>>>
>>>>> Alan.
>>>>>
>>>>>
>>>> --
>>>> Christopher Olston, Ph.D.
>>>> Sr. Research Scientist
>>>> Yahoo! Research
>>>>
>>>>
>>>>
>>>>
>>>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>

Reply via email to