This mail applies only to changes in the types branch.

The central question is whether the declaration of fields in LOAD ... AS constitutes a projection or not.

With the changes in the type branch, we are allowing users to declare types for fields in the load like this:

A = LOAD 'myfile' AS (a: int, b:float);

We would like to implement this as:

A = LOAD 'myfile';
A' = FOREACH A generate (int)$0, (float)$1;

and then let the optimizer push that conversion as far down as possible, or completely remove it in cases where a declared field is never used.

But consider a pig latin script such as:

A = LOAD 'myfile' AS (a: int, b: float);
B = FILTER A BY a > 0;
C = SORT B by a;
STORE C;

What if a given tuple has 3 fields instead of 2? Is that field anonymously carried along and stored as part of C? Or does the AS in LOAD constitute a projection, so that it's legal to lop off any fields past the second (b)?

In favor of carrying it along is the argument that we shouldn't force the user to declare all data in a file, maybe he only wants to declare a few fields he needs to work with but he still wants to store all the rest.

In favor of lopping it is that the user told us about his data, we're justified in assuming that he described it completely. It is also easier to implement this way, as it allows us to make a set of optimization assumptions.

Thoughts?

Alan.

Reply via email to