[
https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928900#action_12928900
]
Scott Carey commented on PIG-1693:
----------------------------------
If this doesn't work with named aliases, its almost useless for me. Numbered
references are not maintainable, what happens when you add a column to a
complex flow? Or if you remove one? suddenly you are adding numbers to
statements or decrementing numbers all over the place.
Y has 10 named columns, with full schemas.
Use case 1, operate on subset:
{code}
Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result,
forthcol, fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
{code}
Use case 2, remove a subset:
{code}
Z = foreach Y generate firstcol, forthcol, fifthcol, sixthcol, seventhcol,
eigthcol, ninethcol, tenthcol;
{code}
Why not just make the * operator have a few different forms or use a new
operator?
Use case 1 becomes:
{code}
Z = foreach Y generate myUDF(firstcol, secondcol, thridcol) as result, *+;
{code}
*+ would mean "all columns not referenced"
Use case 2 becomes:
{code}
Z = foreach Y generate *- (secondcol, thirdcol);
{code}
and *- generates all columns other than the set right after it.
I'm not saying these are the best operators or syntax, but syntax that did not
involve number ranges and simply 'works' for 'generate all that have not been
referenced' and 'generate all excluding (set of aliases)' would be awesome. I
definitely don't want to be counting aliases to discover that fieldFoo is the
23rd alias and fieldBar is the 29th.
There is a lot of problems with ranges combined with names. And you still
have to keep track of the count of columns which isn't fun when there are 40.
A "shared" alias uses names so that scripts that consume it never has to change
if the alias adds columns, or if it removes columns only scripts that used that
field has to change.
> There needs to be a way in foreach to indicate "and all the rest of the
> fields"
> -------------------------------------------------------------------------------
>
> Key: PIG-1693
> URL: https://issues.apache.org/jira/browse/PIG-1693
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Reporter: Alan Gates
> Assignee: Daniel Dai
> Fix For: 0.9.0
>
>
> A common use case we see in Pig is people have many columns in their data and
> they only want to operate on a few of them. Consider for example if before
> storing data with ten columns, the user wants to perform a cast on one column:
> {code}
> ...
> Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol,
> fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
> store Z into 'output';
> {code}
> Obviously this only gets worse as the user has more columns. Ideally the
> above could be transformed to something like:
> {code}
> ...
> Z = foreach Y generate (int)firstcol, "and all the rest";
> store Z into 'output'
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.