Prashanth Pappu
Wed, 17 Sep 2008 21:25:00 -0700
I think loading only the first column and throwing away the rest of the data is better. Here's my primary use-case: I often chain pig-jobs. So say, p2 uses 'load' to consume the output of p1 (saved with 'store'). Now, if we want p1 to dump more fields that are useful for a third job p3, currently, we're required to change p2's code (load statement specifically). But ideally, I just want to append the newer fields to p1's old schema and have p2's load statement working without any changes. Prashanth On Wed, Sep 17, 2008 at 12:57 PM, Olga Natkovich <[EMAIL PROTECTED]>wrote: > Hi, > > If I ran the query below (and this is based on actual user query): > > -- Note that data1 has more than 1 column but as only declares a single > one > A = load 'data1' as (x); > B = load 'data2' as (x, y, z); > C = JOIN A by x, B by x; > D = foreach C generate y,z; > store D into 'output'; > > the current pig implementation produces wrong results. The reason is > that currently load assumes that complete schema is given to it. The > intention of the user was that (s)he only cares about the first column > as the rest of the data could be thrown out. So in fact, "as" is treated > as project. > > Do Pig users/developers have a strong opinion on how Pig should handle > this case? If so, please, provide use cases. > > Thanks, > > Olga >