I think the latter option of dropping fields more closely matches user intent. Since the user gave a schema in the load, it seems fair to assume that he is interested only in the fields declared and hence expects to see only those fields in output IMHO.
-Pradeep -----Original Message----- From: Alan Gates [mailto:[EMAIL PROTECTED] Sent: Monday, June 02, 2008 9:59 AM To: [email protected] Subject: Does LOAD ... AS constitute a projection? This mail applies only to changes in the types branch. The central question is whether the declaration of fields in LOAD ... AS constitutes a projection or not. With the changes in the type branch, we are allowing users to declare types for fields in the load like this: A = LOAD 'myfile' AS (a: int, b:float); We would like to implement this as: A = LOAD 'myfile'; A' = FOREACH A generate (int)$0, (float)$1; and then let the optimizer push that conversion as far down as possible, or completely remove it in cases where a declared field is never used. But consider a pig latin script such as: A = LOAD 'myfile' AS (a: int, b: float); B = FILTER A BY a > 0; C = SORT B by a; STORE C; What if a given tuple has 3 fields instead of 2? Is that field anonymously carried along and stored as part of C? Or does the AS in LOAD constitute a projection, so that it's legal to lop off any fields past the second (b)? In favor of carrying it along is the argument that we shouldn't force the user to declare all data in a file, maybe he only wants to declare a few fields he needs to work with but he still wants to store all the rest. In favor of lopping it is that the user told us about his data, we're justified in assuming that he described it completely. It is also easier to implement this way, as it allows us to make a set of optimization assumptions. Thoughts? Alan.
