This mail applies only to changes in the types branch.
The central question is whether the declaration of fields in LOAD ... AS
constitutes a projection or not.
With the changes in the type branch, we are allowing users to declare
types for fields in the load like this:
A = LOAD 'myfile' AS (a: int, b:float);
We would like to implement this as:
A = LOAD 'myfile';
A' = FOREACH A generate (int)$0, (float)$1;
and then let the optimizer push that conversion as far down as possible,
or completely remove it in cases where a declared field is never used.
But consider a pig latin script such as:
A = LOAD 'myfile' AS (a: int, b: float);
B = FILTER A BY a > 0;
C = SORT B by a;
STORE C;
What if a given tuple has 3 fields instead of 2? Is that field
anonymously carried along and stored as part of C? Or does the AS in
LOAD constitute a projection, so that it's legal to lop off any fields
past the second (b)?
In favor of carrying it along is the argument that we shouldn't force
the user to declare all data in a file, maybe he only wants to declare a
few fields he needs to work with but he still wants to store all the rest.
In favor of lopping it is that the user told us about his data, we're
justified in assuming that he described it completely. It is also
easier to implement this way, as it allows us to make a set of
optimization assumptions.
Thoughts?
Alan.