I in generally agree with Pradeep that it is cleaner for the user to declare all fields rather than use half names and half positions. However, I could also see the case where the data has very wide schema (say 25 columns) and the script users first 4 and then the field 25. Forcing the user to declare all 25 fields seems excessive. I wonder if we should allow to optionally include column position in the schema - sparse schema.
Olga > -----Original Message----- > From: Pradeep Kamath [mailto:[EMAIL PROTECTED] > Sent: Monday, June 02, 2008 10:19 AM > To: [email protected] > Subject: RE: Does LOAD ... AS constitute a projection? > > I think the latter option of dropping fields more closely > matches user intent. Since the user gave a schema in the > load, it seems fair to assume that he is interested only in > the fields declared and hence expects to see only those > fields in output IMHO. > > -Pradeep > > -----Original Message----- > From: Alan Gates [mailto:[EMAIL PROTECTED] > Sent: Monday, June 02, 2008 9:59 AM > To: [email protected] > Subject: Does LOAD ... AS constitute a projection? > > This mail applies only to changes in the types branch. > > The central question is whether the declaration of fields in > LOAD ... AS > > constitutes a projection or not. > > With the changes in the type branch, we are allowing users to > declare types for fields in the load like this: > > A = LOAD 'myfile' AS (a: int, b:float); > > We would like to implement this as: > > A = LOAD 'myfile'; > A' = FOREACH A generate (int)$0, (float)$1; > > and then let the optimizer push that conversion as far down > as possible, > > or completely remove it in cases where a declared field is > never used. > > But consider a pig latin script such as: > > A = LOAD 'myfile' AS (a: int, b: float); B = FILTER A BY a > > 0; C = SORT B by a; STORE C; > > What if a given tuple has 3 fields instead of 2? Is that > field anonymously carried along and stored as part of C? Or > does the AS in LOAD constitute a projection, so that it's > legal to lop off any fields past the second (b)? > > In favor of carrying it along is the argument that we > shouldn't force the user to declare all data in a file, maybe > he only wants to declare a > > few fields he needs to work with but he still wants to store > all the rest. > > In favor of lopping it is that the user told us about his > data, we're justified in assuming that he described it > completely. It is also easier to implement this way, as it > allows us to make a set of optimization assumptions. > > Thoughts? > > Alan. >
