Chris Olston
Thu, 18 Sep 2008 11:26:06 -0700
I prefer:* LOAD ... AS has to give the full schema (we can even consider enforcing this at run-time, if it's not too expensive ... and I suspect it's not) * if you want to project you do FOREACH ... GENERATE <list of fields you want to retain>
Besides, the purpose of AS is to enable referring to fields by name rather than by position, but if you start using AS for projection then you're projecting by position (i.e., only retaining a K-prefix of the fields), which seems yucky.
The downside to my approach is that if you have 100 fields but you only want the first one, you have to tediously list them all in the LOAD command, only to drop them right after. But in the long run the Pig project intends to introduce stored schemas, and we envision that for data with more than a handful of columns people will use stored schemas, and only use on-the-fly schemas for very simple data sets for which stored schemas may be overkill and exacerbate users (e.g., a unary relation that simply lists a bunch of companies; or a graph represented as a binary (source, destination) relation).
-Chris On Sep 17, 2008, at 9:24 PM, Prashanth Pappu wrote:
I think loading only the first column and throwing away the rest of the datais better. Here's my primary use-case:I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1(saved with 'store').Now, if we want p1 to dump more fields that are useful for a third job p3, currently, we're required to change p2's code (load statement specifically). But ideally, I just want to append the newer fields to p1's old schema andhave p2's load statement working without any changes. PrashanthOn Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <[EMAIL PROTECTED] inc.com>wrote:Hi, If I ran the query below (and this is based on actual user query):-- Note that data1 has more than 1 column but as only declares a singleone A = load 'data1' as (x); B = load 'data2' as (x, y, z); C = JOIN A by x, B by x; D = foreach C generate y,z; store D into 'output'; the current pig implementation produces wrong results. The reason is that currently load assumes that complete schema is given to it. Theintention of the user was that (s)he only cares about the first column as the rest of the data could be thrown out. So in fact, "as" is treatedas project.Do Pig users/developers have a strong opinion on how Pig should handlethis case? If so, please, provide use cases. Thanks, Olga
-- Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research