pig-user  

Re: Question about semantics of "as" on the load statement

Mridul Muralidharan
Wed, 17 Sep 2008 23:32:55 -0700

Olga Natkovich wrote:
Hi,
If I ran the query below (and this is based on actual user query): -- Note that data1 has more than 1 column but as only declares a single
one
A = load 'data1' as (x);
B = load 'data2' as (x, y, z);
C = JOIN A by x, B by x;
D = foreach C generate y,z;
store D into 'output';
the current pig implementation produces wrong results. The reason is
that currently load assumes that complete schema is  given to it. The
intention of the user was that (s)he only cares about the first column
as the rest of the data could be thrown out. So in fact, "as" is treated
as project.
Do Pig users/developers have a strong opinion on how Pig should handle
this case? If so, please, provide use cases.

If you look at the usecases enabled by each :

a) If the intention is to restrict the fields to what is specified in the schema, then a project following the load would do that for the user - the implicit project is just doing the same. So not supporting this requirement would not hamper expressibility or usability.

b) If the intention is to 'use' the fields specified in schema in the script - but leave the other as-is : to be propogated all the way to output (which might be processed by some other program/script), then a restrictive load would make this usecase near-impossible (unless users stop using schema - not sure how pig2.0 behaves in that case).


Regards,
Mridul


Thanks, Olga