[
https://issues.apache.org/jira/browse/PIG-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923953#action_12923953
]
Alan Gates commented on PIG-1693:
---------------------------------
I can see a couple of ways of approaching this.
One would be something like the colon operator in Python, meaning everything in
between. As colon is not widely used for this across programming languages, I
propose '...' instead, since that is the natural language meaning of ellipses.
If it was used before a certain field it would mean the beginning up to that
field:
{code}
B = foreach A generate ..., $10;
{code}
would mean $0-$9
If used between two fields, it would mean everything in between:
{code}
B = foreach A generate $7, ..., $10;
{code}
would mean $8 and $9.
If used at the end of the line, it would mean everything after the last
referenced field:
{code}
B = foreach A generate $10, ...;
{code}
would mean $11 to the end of the record.
Another approach would be to define a symbol that means "all fields not
referenced in this list of expressions". If, for
example, we chose @ to mean this, then:
{code}
B = foreach A generate $10, @;
{code}
would mean $0-$9, and $11 to the end.
Then does $10 keep its place as the eleventh field or become the first field?
I like the '...' option better, as it allows more control of ordering and will
be easier for users to understand.
Whichever one we choose we have to answer what it means if an expression
contains more than one field:
{code}
B = foreach A generate udf($3, $5), ..., udf($8, $10);
{code}
What range does '...' include? I propose it includes the highest column number
on the left and the lowest on the right (thus in this example, $6 and $7).
In the @ case it's clear that @ would refer to $0, $1, $2, $4, $6, $7, $9, and
anything past $10. But the ordering becomes even stickier. Where do $4 and $9
go?
In cases where Pig knows the schema, the '...' or '@' operator could be
resolved at compile time. This will be more efficient. In cases where it does
not, an new physical operator would be required to handle the @ or ellipse end
case "$1, ..." as we cannot construct a set of projections that knows exactly
which columns to pass through.
> There needs to be a way in foreach to indicate "and all the rest of the
> fields"
> -------------------------------------------------------------------------------
>
> Key: PIG-1693
> URL: https://issues.apache.org/jira/browse/PIG-1693
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Reporter: Alan Gates
>
> A common use case we see in Pig is people have many columns in their data and
> they only want to operate on a few of them. Consider for example if before
> storing data with ten columns, the user wants to perform a cast on one column:
> {code}
> ...
> Z = foreach Y generate (int)firstcol, secondcol, thridcol, forthcol,
> fifthcol, sixthcol, seventhcol, eigthcol, ninethcol, tenthcol;
> store Z into 'output';
> {code}
> Obviously this only gets worse as the user has more columns. Ideally the
> above could be transformed to something like:
> {code}
> ...
> Z = foreach Y generate (int)firstcol, "and all the rest";
> store Z into 'output'
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.