[ 
https://issues.apache.org/jira/browse/PIG-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628510#action_12628510
 ] 

Santhosh Srinivasan commented on PIG-335:
-----------------------------------------

The proposed design for computing the lineage for load functions:

The FieldSchema class will include an additional member variable that will 
contain a  list of parent/ancestral canonical names. The list of parent 
canonical names corresponds to the canonical names required by the operator to 
compute the field schema.

The parent list will be empty for canonical names that originate from the load 
statement and remain unchanged as they move from operator to operator. Only 
expressions (like arithmentic expressions, etc) will create new canonical names.

The load operator corresponding to the parent canonical name is required only 
to cast byte arrays into Pig types. Other than UDFs, there are no operators 
that generate byte arrays. CONCAT (also an UDF) can generate byte arrays. For 
now, its an UDF.

To compute the load function associated with field schema, each canonical name 
in the parent list of canonical names is matched against the operator 
responsible for the canonical name. If the operator is an UDF, then we throw an 
exception as we will not know how to convert a byte array generated by the UDF 
into a Pig type. The check bubbles up the graph until we hit the load operator 
corresponding to the canonical name under question.

Breakdown of the changes:

1. The logic mentioned in the previous paragraph will reside in the type 
checker. 
2. The changes to the FieldSchema will (of course) be in limited to the 
FieldSchema class. 
3. The computation of the list of the parent canonical names will happen in 
each logical operator.

Thoughts/comments on the proposed design are welcome.

> Casting does not work in certain cases with multiple loads
> ----------------------------------------------------------
>
>                 Key: PIG-335
>                 URL: https://issues.apache.org/jira/browse/PIG-335
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Santhosh Srinivasan
>            Priority: Critical
>             Fix For: types_branch
>
>
> Given a script like:
> A = load 'bla' as (x, y) using Loader1();
> B = load 'morebla' as (s, t) using Loader2();
> C = cogroup A by x, B by s;
> D = foreach C generate flatten(A), flatten(B);
> E = foreach D generate x, y, t + 1;
> In this case, in the last foreach, a cast will need to be added to t + 1 to 
> allow t (a byte array) to be added to an integer.  We use load functions to 
> handle this late casting.  The issue is that we do not currently have a way 
> to know whether to use Loader1 or Loader2 to cast the data.  We need to track 
> the lineage of fields so that the cast operator can select the correct loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to