Chris Olston
Thu, 18 Sep 2008 18:44:55 -0700
Excellent point ... you've changed my mind -- I agree! -Chris On Sep 18, 2008, at 11:50 AM, Prashanth Pappu wrote:
I agree with the ideas in principle.But projection during load has code/functionality upgrade advantages. And isvery desirable since (a) Chaining of PIG jobs is very common(b) Outputs of intermediate PIG jobs (which are used as input to other pigjobs) are frequently changed to support new jobsHere's a more descriptive version of the example. Consider version 1 andversion 2 of PIG/hadoop based jobs Version 1 (P1-> P2): Pig job P1load 'p1-in' as (a,b); ... some processing store (a,b,c) into 'p1-out';Pig job P2load 'p1-out' as (a,b,c); ... some processingVersion 2 (P1->P2, P1->P3): Pig job P1load 'p1-in' as (a,b); ... some processing store (a,b,c,d) into 'p1-out';Pig job P2 -- same as version 1 Pig job P3load 'p1-out' as (a,b,c,d); ..some processingIn developing version 2, note that currently all three scripts P1, P2, P3have to be changed. In P2 specifically, the 'load' statement has to bechanged to use the new output schema of P1. But if 'load' were defined to only load the first few fields defined in the load-statement, no changeshave to be made for P2!I have run into this problem many times before. And the issue is common indatabases too.1. The dictum that "adding fields to an sql table will not break old sql queries" is very useful in upgrading the tables to include newer fields.2. In PIG, if we can claim that "appending fields to a data file will not break old pig scripts" then it will solve many of the upgrade problems.And in this context, it is useful to limit 'LOAD ... AS to read in only the first X fields of the raw log where X is the number of fields in the loadstatement. PrashanthOn Thu, Sep 18, 2008 at 11:24 AM, Chris Olston <[EMAIL PROTECTED] inc.com> wrote:I don't like the idea that there are two separate mechanisms to do projection of unwanted fields. I prefer:* LOAD ... AS has to give the full schema (we can even consider enforcing this at run-time, if it's not too expensive ... and I suspect it's not) * if you want to project you do FOREACH ... GENERATE <list of fields youwant to retain>Besides, the purpose of AS is to enable referring to fields by name rather than by position, but if you start using AS for projection then you're projecting by position (i.e., only retaining a K-prefix of the fields),which seems yucky.The downside to my approach is that if you have 100 fields but you only want the first one, you have to tediously list them all in the LOAD command, only to drop them right after. But in the long run the Pig project intends to introduce stored schemas, and we envision that for data with more than a handful of columns people will use stored schemas, and only use on- the-fly schemas for very simple data sets for which stored schemas may be overkill and exacerbate users (e.g., a unary relation that simply lists a bunch ofcompanies; or a graph represented as a binary (source, destination) relation). -Chris On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:I think loading only the first column and throwing away the rest of thedata is better. Here's my primary use-case:I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1(saved with 'store').Now, if we want p1 to dump more fields that are useful for a third job p3,currently, we're required to change p2's code (load statement specifically).But ideally, I just want to append the newer fields to p1's old schema andhave p2's load statement working without any changes. PrashanthOn Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <[EMAIL PROTECTED] inc.comwrote:Hi,If I ran the query below (and this is based on actual user query):-- Note that data1 has more than 1 column but as only declares a singleone A = load 'data1' as (x); B = load 'data2' as (x, y, z); C = JOIN A by x, B by x; D = foreach C generate y,z; store D into 'output';the current pig implementation produces wrong results. The reason is that currently load assumes that complete schema is given to it. The intention of the user was that (s)he only cares about the first column as the rest of the data could be thrown out. So in fact, "as" is treatedas project.Do Pig users/developers have a strong opinion on how Pig should handlethis case? If so, please, provide use cases. Thanks, Olga-- Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research
-- Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research