I think that I am convinced III is best. On Fri, Jun 13, 2008 at 7:26 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> All, > > I too will vote for III, with the caveat that we don't give names to > multi-field grouping keys. We need to make sure we support AS to allow the > user to name their grouping keys if they want. > > So far, the vote totals are: > I: 1 > II: 0 > III: 3 > IV: 0 > V: 0 > > I'd like to make a decision and move forward by mid next week. If you > haven't voted and you'd like to, please do so now. If you feel passionately > about one of the options that is loosing, please make your arguments now. > > Alan. > > Alan Gates wrote: > >> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field >> (or set of fields) that are grouped on are given the alias 'group'. This >> has a couple of issues: >> >> 1) It's confusing. 'group' is now a keyword and an alias. >> 2) We don't currently allow 'group' as an alias in an AS. It is strange >> to have an alias that can only be assigned by the language and never by the >> user. >> >> Possible solutions: >> >> I) Status quo. We could fix it so that group is allowed to be assigned as >> an alias in AS. >> >> Pros: Backward compatibility >> Cons: a) will make the parser more complicated >> b) see 1) above. >> >> >> II) Don't give an implicit alias to the group key(s). If users want an >> alias, they can assign it using AS. >> >> Pros: Simplicity >> Cons: We do assign aliases to grouped bags. That is, if we have C = >> GROUP B by $0 the resulting schema of C is (group, B). So if we don't >> assign an alias to the group key, we now have a schema ($0, B). This seems >> strange. And worse yet, if users want to alias the group key(s), they'll be >> forced to alias all the grouped bags as well. >> >> III) Carry the alias (if any) that the field had before. So if we had a >> script like: >> >> A = load 'myfile' as (x, y, z); >> B = group A by x; >> >> The the schema of B would be (x, A). This is quite natural for grouping >> of single columns. But it turns nasty when you group on multiple columns. >> Do we then append the names to together? So if you have >> >> B = group A by x, y; >> >> is the resulting schema (x_y, A)? Ugh. >> >> In this case there is also the question of what to do in the case of >> cogroups, where the key may be named differently in different relations. >> >> A = load 'myfile' as (x, y, z); >> B = load 'myotherfile' as (t, u, v); >> C = cogroup A by x, B by t; >> >> Is the resulting schema (x, A, B) or (t, A, B) or are both valid? This >> could be resolved by either saying first one always wins, or allowing >> either. >> >> Pros: Very natural for the users, their fields maintain names through the >> query. >> Cons: Quickly gets burdensome in the case of multi-key groups. >> >> IV) Assign a non-keyword alias to the group key, like grp or groupkey or >> grpkey (or some other suitable choice). >> Pros: Least disruptive change. Users only have to go through their >> scripts and find places where they use the group alias and change it to grp >> (or whatever). >> Cons: Still leaves us with a situation where we are assigning a name to a >> field arbtrarily, leaving users confused as to how their fields got named >> that. >> >> V) Remove GROUP as a keyword. It is just short for COGROUP of one >> relation anyway. >> >> Pros: Smaller syntax in a language is always good. >> Cons: Will break a lot of scripts, and confuse a lot of users who only >> think in terms of GROUP and JOIN and never use COGROUP explicitly. >> >> One could also conceive of combinations of these. For example, we always >> assign a name like grpkey to the group key(s), and in the single key case we >> also carry forward the alias that the field already had, if any. >> >> Thoughts? Other possibilities? >> >> Alan. >> > -- ted
