[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882711#action_12882711
 ] 

Aniket Mokashi commented on PIG-1434:
-------------------------------------

The proposal for scalars is as follows -
{code}
A = load '1.txt' as (a1, a2);
B = group A all;
C = foreach B generate COUNT(A);
Y = foreach A generate C;
store Y into 'Ystore';
{code}
Based on the schema of C, we detect that Y means to use C as a scalar and 
internally track it as scalar. Thus, operations like C * C are also allowed. 
The limitation is that C should have long convertible value (when stored into 
the file). Also (int) C would be allowed and will succeed if the cast operation 
succeeds.

As mentioned by Daniel earlier, there are two challenges in introducing 
scalars--
1. Addition of implicit store- We cannot do it too early (parsing), as we get 
redundant (implicit) store operation for rest of the commands in the script. If 
we do it too late, merge algorithm doesn't find the store and discards the 
branch that compiles and executes the store.
To solve this, whenever we process a store plan after the parsing stage, we 
detect the existence of scalars into the plan and add required branches that 
has those scalars into the current plan. We also attach LOStores for the 
scalars and merge the required plan.
2. Tracking of implicit dependency- Existence of scalar C needs to be converted 
into a implicit ReadScalar operation, but other than this it also needs to add 
dependency on the map-reduce job that generates this scalar value. We track 
this dependency by adding LOScalar, POScalar operators that carry the reference 
to the scalar they depend upon. When we compile the map reduce plan, we replace 
POScalar with POUserFunc to load the scalar value and mark the dependency 
between two map reduce jobs.

I am attaching the patch with above mentioned changes.

Few known issues-
To track the dependencies of scalars, we need access to map of operators from 
one type of plan to other, but this map is generated by visitors. The same 
visitors are responsible for converting LOScalar ->POScalar -> POUserFunc. So, 
if a visitor visits LOScalar before LO associated with scalar ( C in example) 
we do not find PO associated with C. 

> Allow casting relations to scalars
> ----------------------------------
>
>                 Key: PIG-1434
>                 URL: https://issues.apache.org/jira/browse/PIG-1434
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Aniket Mokashi
>             Fix For: 0.8.0
>
>         Attachments: scalarImpl.patch
>
>
> This jira is to implement a simplified version of the functionality described 
> in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .....
> X = ....
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be 
> reported
> (2) Name resolution is needed since relation X might have field named C in 
> which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. 
> I believe we already have a UDF that Ben Reed contributed for this purpose. 
> Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to