[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882711#action_12882711 ]
Aniket Mokashi commented on PIG-1434: ------------------------------------- The proposal for scalars is as follows - {code} A = load '1.txt' as (a1, a2); B = group A all; C = foreach B generate COUNT(A); Y = foreach A generate C; store Y into 'Ystore'; {code} Based on the schema of C, we detect that Y means to use C as a scalar and internally track it as scalar. Thus, operations like C * C are also allowed. The limitation is that C should have long convertible value (when stored into the file). Also (int) C would be allowed and will succeed if the cast operation succeeds. As mentioned by Daniel earlier, there are two challenges in introducing scalars-- 1. Addition of implicit store- We cannot do it too early (parsing), as we get redundant (implicit) store operation for rest of the commands in the script. If we do it too late, merge algorithm doesn't find the store and discards the branch that compiles and executes the store. To solve this, whenever we process a store plan after the parsing stage, we detect the existence of scalars into the plan and add required branches that has those scalars into the current plan. We also attach LOStores for the scalars and merge the required plan. 2. Tracking of implicit dependency- Existence of scalar C needs to be converted into a implicit ReadScalar operation, but other than this it also needs to add dependency on the map-reduce job that generates this scalar value. We track this dependency by adding LOScalar, POScalar operators that carry the reference to the scalar they depend upon. When we compile the map reduce plan, we replace POScalar with POUserFunc to load the scalar value and mark the dependency between two map reduce jobs. I am attaching the patch with above mentioned changes. Few known issues- To track the dependencies of scalars, we need access to map of operators from one type of plan to other, but this map is generated by visitors. The same visitors are responsible for converting LOScalar ->POScalar -> POUserFunc. So, if a visitor visits LOScalar before LO associated with scalar ( C in example) we do not find PO associated with C. > Allow casting relations to scalars > ---------------------------------- > > Key: PIG-1434 > URL: https://issues.apache.org/jira/browse/PIG-1434 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Aniket Mokashi > Fix For: 0.8.0 > > Attachments: scalarImpl.patch > > > This jira is to implement a simplified version of the functionality described > in https://issues.apache.org/jira/browse/PIG-801. > The proposal is to allow casting relations to scalar types in foreach. > Example: > A = load 'data' as (x, y, z); > B = group A all; > C = foreach B generate COUNT(A); > ..... > X = .... > Y = foreach X generate $1/(long) C; > Couple of additional comments: > (1) You can only cast relations including a single value or an error will be > reported > (2) Name resolution is needed since relation X might have field named C in > which case that field takes precedence. > (3) Y will look for C closest to it. > Implementation thoughts: > The idea is to store C into a file and then convert it into scalar via a UDF. > I believe we already have a UDF that Ben Reed contributed for this purpose. > Most of the work would be to update the logical plan to > (1) Store C > (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.