I think if users don't specify, it means they don't want (based on most of
the stuffs I have used). This is already least surprising.

To allow carrying unspecified columns forward, shouldn't it be better if we
add a special keyword later?

For sparse-schema, I think it's a must. Though I have no idea how to fit it
in the current syntax. We should start thinking about it NOW.

Question!!
I'm still a bit confused about this bit. If I say:-
A = LOAD 'myfile' AS (a: int, b: float);

Case 1: If the data file is not self-describing, "a" and "b" should match
column 0 and column 1 respectively.
Case 2: If the data file is self-describing, will "a" and "b" match 0 and 1?
or match columns "a" and "b" specified by self-metadata?

Pi


On Tue, Jun 3, 2008 at 3:48 AM, Olga Natkovich <[EMAIL PROTECTED]> wrote:

> I in generally agree with Pradeep that it is cleaner for the user to
> declare all fields rather than use half names and half positions.
> However, I could also see the case where the data has very wide schema
> (say 25 columns) and the script users first 4 and then the field 25.
> Forcing the user to declare all 25 fields seems excessive. I wonder if
> we should allow to optionally include column position in the schema -
> sparse schema.
>
> Olga
>
> > -----Original Message-----
> > From: Pradeep Kamath [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 02, 2008 10:19 AM
> > To: [email protected]
> > Subject: RE: Does LOAD ... AS constitute a projection?
> >
> > I think the latter option of dropping fields more closely
> > matches user intent. Since the user gave a schema in the
> > load, it seems fair to assume that he is interested only in
> > the fields declared and hence expects to see only those
> > fields in output IMHO.
> >
> > -Pradeep
> >
> > -----Original Message-----
> > From: Alan Gates [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 02, 2008 9:59 AM
> > To: [email protected]
> > Subject: Does LOAD ... AS constitute a projection?
> >
> > This mail applies only to changes in the types branch.
> >
> > The central question is whether the declaration of fields in
> > LOAD ... AS
> >
> > constitutes a projection or not.
> >
> > With the changes in the type branch, we are allowing users to
> > declare types for fields in the load like this:
> >
> > A = LOAD 'myfile' AS (a: int, b:float);
> >
> > We would like to implement this as:
> >
> > A = LOAD 'myfile';
> > A' = FOREACH A generate (int)$0, (float)$1;
> >
> > and then let the optimizer push that conversion as far down
> > as possible,
> >
> > or  completely remove it in cases where a declared field is
> > never used.
> >
> > But consider a pig latin script such as:
> >
> > A = LOAD 'myfile' AS (a: int, b: float); B = FILTER A BY a >
> > 0; C = SORT B by a; STORE C;
> >
> > What if a given tuple has 3 fields instead of 2?  Is that
> > field anonymously carried along and stored as part of C?  Or
> > does the AS in LOAD constitute a projection, so that it's
> > legal to lop off any fields past the second (b)?
> >
> > In favor of carrying it along is the argument that we
> > shouldn't force the user to declare all data in a file, maybe
> > he only wants to declare a
> >
> > few fields he needs to work with but he still wants to store
> > all the rest.
> >
> > In favor of lopping it is that the user told us about his
> > data, we're justified in assuming that he described it
> > completely.  It is also easier to implement this way, as it
> > allows us to make a set of optimization assumptions.
> >
> > Thoughts?
> >
> > Alan.
> >
>

Reply via email to