I think the latter option of dropping fields more closely matches user
intent. Since the user gave a schema in the load, it seems fair to
assume that he is interested only in the fields declared and hence
expects to see only those fields in output IMHO.

-Pradeep

-----Original Message-----
From: Alan Gates [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 02, 2008 9:59 AM
To: [email protected]
Subject: Does LOAD ... AS constitute a projection?

This mail applies only to changes in the types branch.

The central question is whether the declaration of fields in LOAD ... AS

constitutes a projection or not.

With the changes in the type branch, we are allowing users to declare 
types for fields in the load like this:

A = LOAD 'myfile' AS (a: int, b:float);

We would like to implement this as:

A = LOAD 'myfile';
A' = FOREACH A generate (int)$0, (float)$1;

and then let the optimizer push that conversion as far down as possible,

or  completely remove it in cases where a declared field is never used.

But consider a pig latin script such as:

A = LOAD 'myfile' AS (a: int, b: float);
B = FILTER A BY a > 0;
C = SORT B by a;
STORE C;

What if a given tuple has 3 fields instead of 2?  Is that field 
anonymously carried along and stored as part of C?  Or does the AS in 
LOAD constitute a projection, so that it's legal to lop off any fields 
past the second (b)?

In favor of carrying it along is the argument that we shouldn't force 
the user to declare all data in a file, maybe he only wants to declare a

few fields he needs to work with but he still wants to store all the
rest.

In favor of lopping it is that the user told us about his data, we're 
justified in assuming that he described it completely.  It is also 
easier to implement this way, as it allows us to make a set of 
optimization assumptions.

Thoughts?

Alan.

Reply via email to