pig-user  

Re: Question about semantics of "as" on the load statement

Prashanth Pappu
Thu, 18 Sep 2008 11:51:29 -0700

I agree with the ideas in principle.

But projection during load has code/functionality upgrade advantages. And is
very desirable since
(a) Chaining of PIG jobs is very common
(b) Outputs of intermediate PIG jobs (which are used as input to other pig
jobs) are frequently changed to support new jobs

Here's a more descriptive version of the example. Consider version 1 and
version 2 of PIG/hadoop based jobs

Version 1 (P1-> P2):

Pig job P1
> load 'p1-in' as (a,b);
> ... some processing
> store  (a,b,c) into 'p1-out';

Pig job P2
> load 'p1-out' as (a,b,c);
>... some processing

Version 2 (P1->P2, P1->P3):

Pig job P1
> load 'p1-in' as (a,b);
> ... some processing
> store (a,b,c,d) into 'p1-out';

Pig job P2
-- same as version 1

Pig job P3
> load 'p1-out' as (a,b,c,d);
> ..some processing

In developing version 2, note that currently all three scripts P1, P2, P3
have to be changed. In P2 specifically, the 'load' statement has to be
changed to use the new output schema of P1. But if 'load' were defined to
only load the first few fields defined in the load-statement, no changes
have to be made for P2!

I have run into this problem many times before. And the issue is common in
databases too.

1. The dictum that "adding fields to an sql table will not break old sql
queries" is very useful in upgrading the tables to include newer fields.

2. In PIG, if we can claim that "appending fields to a data file will not
break old pig scripts" then it will solve many of the upgrade problems.

And in this context, it is useful to limit 'LOAD ... AS to read in only the
first X fields of the raw log where X is the number of fields in the load
statement.

Prashanth

On Thu, Sep 18, 2008 at 11:24 AM, Chris Olston <[EMAIL PROTECTED]> wrote:

> I don't like the idea that there are two separate mechanisms to do
> projection of unwanted fields.
>
> I prefer:
>  * LOAD ... AS has to give the full schema (we can even consider enforcing
> this at run-time, if it's not too expensive ... and I suspect it's not)
>  * if you want to project you do FOREACH ... GENERATE <list of fields you
> want to retain>
>
> Besides, the purpose of AS is to enable referring to fields by name rather
> than by position, but if you start using AS for projection then you're
> projecting by position (i.e., only retaining a K-prefix of the fields),
> which seems yucky.
>
> The downside to my approach is that if you have 100 fields but you only
> want the first one, you have to tediously list them all in the LOAD command,
> only to drop them right after. But in the long run the Pig project intends
> to introduce stored schemas, and we envision that for data with more than a
> handful of columns people will use stored schemas, and only use on-the-fly
> schemas for very simple data sets for which stored schemas may be overkill
> and exacerbate users (e.g., a unary relation that simply lists a bunch of
> companies; or a graph represented as a binary (source, destination)
> relation).
>
> -Chris
>
>
>
> On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:
>
>  I think loading only the first column and throwing away the rest of the
>> data
>> is better.
>>
>> Here's my primary use-case:
>>
>> I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1
>> (saved with 'store').
>> Now, if we want p1 to dump more fields that are useful for a third job p3,
>> currently, we're required to change p2's code (load statement
>> specifically).
>> But ideally, I just want to append the newer fields to p1's old schema and
>> have p2's load statement working without any changes.
>>
>> Prashanth
>> On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <[EMAIL PROTECTED]
>> >wrote:
>>
>>  Hi,
>>>
>>> If I ran the query below (and this is based on actual user query):
>>>
>>> -- Note that data1 has more than 1 column but as only declares a single
>>> one
>>> A = load 'data1' as (x);
>>> B = load 'data2' as (x, y, z);
>>> C = JOIN A by x, B by x;
>>> D = foreach C generate y,z;
>>> store D into 'output';
>>>
>>> the current pig implementation produces wrong results. The reason is
>>> that currently load assumes that complete schema is  given to it. The
>>> intention of the user was that (s)he only cares about the first column
>>> as the rest of the data could be thrown out. So in fact, "as" is treated
>>> as project.
>>>
>>> Do Pig users/developers have a strong opinion on how Pig should handle
>>> this case? If so, please, provide use cases.
>>>
>>> Thanks,
>>>
>>> Olga
>>>
>>>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>
>