I think if users don't specify, it means they don't want (based on most of the stuffs I have used). This is already least surprising.
To allow carrying unspecified columns forward, shouldn't it be better if we add a special keyword later? For sparse-schema, I think it's a must. Though I have no idea how to fit it in the current syntax. We should start thinking about it NOW. Question!! I'm still a bit confused about this bit. If I say:- A = LOAD 'myfile' AS (a: int, b: float); Case 1: If the data file is not self-describing, "a" and "b" should match column 0 and column 1 respectively. Case 2: If the data file is self-describing, will "a" and "b" match 0 and 1? or match columns "a" and "b" specified by self-metadata? Pi On Tue, Jun 3, 2008 at 3:48 AM, Olga Natkovich <[EMAIL PROTECTED]> wrote: > I in generally agree with Pradeep that it is cleaner for the user to > declare all fields rather than use half names and half positions. > However, I could also see the case where the data has very wide schema > (say 25 columns) and the script users first 4 and then the field 25. > Forcing the user to declare all 25 fields seems excessive. I wonder if > we should allow to optionally include column position in the schema - > sparse schema. > > Olga > > > -----Original Message----- > > From: Pradeep Kamath [mailto:[EMAIL PROTECTED] > > Sent: Monday, June 02, 2008 10:19 AM > > To: [email protected] > > Subject: RE: Does LOAD ... AS constitute a projection? > > > > I think the latter option of dropping fields more closely > > matches user intent. Since the user gave a schema in the > > load, it seems fair to assume that he is interested only in > > the fields declared and hence expects to see only those > > fields in output IMHO. > > > > -Pradeep > > > > -----Original Message----- > > From: Alan Gates [mailto:[EMAIL PROTECTED] > > Sent: Monday, June 02, 2008 9:59 AM > > To: [email protected] > > Subject: Does LOAD ... AS constitute a projection? > > > > This mail applies only to changes in the types branch. > > > > The central question is whether the declaration of fields in > > LOAD ... AS > > > > constitutes a projection or not. > > > > With the changes in the type branch, we are allowing users to > > declare types for fields in the load like this: > > > > A = LOAD 'myfile' AS (a: int, b:float); > > > > We would like to implement this as: > > > > A = LOAD 'myfile'; > > A' = FOREACH A generate (int)$0, (float)$1; > > > > and then let the optimizer push that conversion as far down > > as possible, > > > > or completely remove it in cases where a declared field is > > never used. > > > > But consider a pig latin script such as: > > > > A = LOAD 'myfile' AS (a: int, b: float); B = FILTER A BY a > > > 0; C = SORT B by a; STORE C; > > > > What if a given tuple has 3 fields instead of 2? Is that > > field anonymously carried along and stored as part of C? Or > > does the AS in LOAD constitute a projection, so that it's > > legal to lop off any fields past the second (b)? > > > > In favor of carrying it along is the argument that we > > shouldn't force the user to declare all data in a file, maybe > > he only wants to declare a > > > > few fields he needs to work with but he still wants to store > > all the rest. > > > > In favor of lopping it is that the user told us about his > > data, we're justified in assuming that he described it > > completely. It is also easier to implement this way, as it > > allows us to make a set of optimization assumptions. > > > > Thoughts? > > > > Alan. > > >
