[ 
https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923953#action_12923953
 ] 

Alan Gates commented on PIG-1693:
---------------------------------

I can see a couple of ways of approaching this.  

One would be something like the colon operator in Python, meaning everything in 
between.  As colon is not widely used for this across programming languages, I 
propose '...' instead, since that is the natural language meaning of ellipses.  
If it was used before a certain field it would mean the beginning up to that 
field:

{code}
B = foreach A generate ..., $10;
{code}
 would mean $0-$9

If used between two fields, it would mean everything in between:

{code}
B = foreach A generate $7, ..., $10;
{code}
would mean $8 and $9.

If used at the end of the line, it would mean everything after the last 
referenced field:

{code}
B = foreach A generate $10, ...;
{code}

would mean $11 to the end of the record.

Another approach would be to define a symbol that means "all fields not 
referenced in this list of expressions".  If, for
example, we chose @ to mean this, then:

{code}
B = foreach A generate $10, @;
{code}
would mean $0-$9, and $11 to the end.

Then does $10 keep its place as the eleventh field or become the first field?

I like the '...' option better, as it allows more control of ordering and will 
be easier for users to understand.

Whichever one we choose we have to answer what it means if an expression 
contains more than one field:

{code}
B = foreach A generate udf($3, $5), ..., udf($8, $10);
{code}

What range does '...' include?  I propose it includes the highest column number 
on the left and the lowest on the right (thus in this example, $6 and $7). 

In the @ case it's clear that @ would refer to $0, $1, $2, $4, $6, $7, $9, and 
anything past $10.  But the ordering becomes even stickier.  Where do $4 and $9 
go?

In cases where Pig knows the schema, the '...' or '@' operator could be 
resolved at compile time.  This will be more efficient.  In cases where it does 
not, an new physical operator would be required to handle the @ or ellipse end 
case "$1, ..." as we cannot construct a set of projections that knows exactly 
which columns to pass through.



> There needs to be a way in foreach to indicate "and all the rest of the 
> fields"
> -------------------------------------------------------------------------------
>
>                 Key: PIG-1693
>                 URL: https://issues.apache.org/jira/browse/PIG-1693
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>
> A common use case we see in Pig is people have many columns in their data and 
> they only want to operate on a few of them.  Consider for example if before 
> storing data with ten columns, the user wants to perform a cast on one column:
> {code}
> ...
> Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol, 
> fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
> store Z into 'output';
> {code}
> Obviously this only gets worse as the user has more columns.  Ideally the 
> above could be transformed to something like:
> {code}
> ...
> Z = foreach Y generate (int)firstcol, "and all the rest";
> store Z into 'output'
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to