pig-user  

Re: Question about semantics of "as" on the load statement

Chris Olston
Thu, 18 Sep 2008 18:44:55 -0700

Excellent point ... you've changed my mind -- I agree!

-Chris

On Sep 18, 2008, at 11:50 AM, Prashanth Pappu wrote:

I agree with the ideas in principle.

But projection during load has code/functionality upgrade advantages. And is
very desirable since
(a) Chaining of PIG jobs is very common
(b) Outputs of intermediate PIG jobs (which are used as input to other pig
jobs) are frequently changed to support new jobs

Here's a more descriptive version of the example. Consider version 1 and
version 2 of PIG/hadoop based jobs

Version 1 (P1-> P2):

Pig job P1
load 'p1-in' as (a,b);
... some processing
store  (a,b,c) into 'p1-out';

Pig job P2
load 'p1-out' as (a,b,c);
... some processing

Version 2 (P1->P2, P1->P3):

Pig job P1
load 'p1-in' as (a,b);
... some processing
store (a,b,c,d) into 'p1-out';

Pig job P2
-- same as version 1

Pig job P3
load 'p1-out' as (a,b,c,d);
..some processing

In developing version 2, note that currently all three scripts P1, P2, P3
have to be changed. In P2 specifically, the 'load' statement has to be
changed to use the new output schema of P1. But if 'load' were defined to only load the first few fields defined in the load-statement, no changes
have to be made for P2!

I have run into this problem many times before. And the issue is common in
databases too.

1. The dictum that "adding fields to an sql table will not break old sql queries" is very useful in upgrading the tables to include newer fields.

2. In PIG, if we can claim that "appending fields to a data file will not break old pig scripts" then it will solve many of the upgrade problems.

And in this context, it is useful to limit 'LOAD ... AS to read in only the first X fields of the raw log where X is the number of fields in the load
statement.

Prashanth

On Thu, Sep 18, 2008 at 11:24 AM, Chris Olston <[EMAIL PROTECTED] inc.com> wrote:

I don't like the idea that there are two separate mechanisms to do
projection of unwanted fields.

I prefer:
* LOAD ... AS has to give the full schema (we can even consider enforcing this at run-time, if it's not too expensive ... and I suspect it's not) * if you want to project you do FOREACH ... GENERATE <list of fields you
want to retain>

Besides, the purpose of AS is to enable referring to fields by name rather than by position, but if you start using AS for projection then you're projecting by position (i.e., only retaining a K-prefix of the fields),
which seems yucky.

The downside to my approach is that if you have 100 fields but you only want the first one, you have to tediously list them all in the LOAD command, only to drop them right after. But in the long run the Pig project intends to introduce stored schemas, and we envision that for data with more than a handful of columns people will use stored schemas, and only use on- the-fly schemas for very simple data sets for which stored schemas may be overkill and exacerbate users (e.g., a unary relation that simply lists a bunch of
companies; or a graph represented as a binary (source, destination)
relation).

-Chris



On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:

I think loading only the first column and throwing away the rest of the
data
is better.

Here's my primary use-case:

I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1
(saved with 'store').
Now, if we want p1 to dump more fields that are useful for a third job p3,
currently, we're required to change p2's code (load statement
specifically).
But ideally, I just want to append the newer fields to p1's old schema and
have p2's load statement working without any changes.

Prashanth
On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <[EMAIL PROTECTED] inc.com
wrote:

 Hi,

If I ran the query below (and this is based on actual user query):

-- Note that data1 has more than 1 column but as only declares a single
one
A = load 'data1' as (x);
B = load 'data2' as (x, y, z);
C = JOIN A by x, B by x;
D = foreach C generate y,z;
store D into 'output';

the current pig implementation produces wrong results. The reason is that currently load assumes that complete schema is given to it. The intention of the user was that (s)he only cares about the first column as the rest of the data could be thrown out. So in fact, "as" is treated
as project.

Do Pig users/developers have a strong opinion on how Pig should handle
this case? If so, please, provide use cases.

Thanks,

Olga


--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research




--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research