Mridul Muralidharan
Wed, 17 Sep 2008 23:32:55 -0700
Olga Natkovich wrote:
Hi,If I ran the query below (and this is based on actual user query): -- Note that data1 has more than 1 column but as only declares a singleone A = load 'data1' as (x); B = load 'data2' as (x, y, z); C = JOIN A by x, B by x; D = foreach C generate y,z; store D into 'output';the current pig implementation produces wrong results. The reason isthat currently load assumes that complete schema is given to it. The intention of the user was that (s)he only cares about the first column as the rest of the data could be thrown out. So in fact, "as" is treated as project.Do Pig users/developers have a strong opinion on how Pig should handlethis case? If so, please, provide use cases.
If you look at the usecases enabled by each :a) If the intention is to restrict the fields to what is specified in the schema, then a project following the load would do that for the user - the implicit project is just doing the same. So not supporting this requirement would not hamper expressibility or usability.
b) If the intention is to 'use' the fields specified in schema in the script - but leave the other as-is : to be propogated all the way to output (which might be processed by some other program/script), then a restrictive load would make this usecase near-impossible (unless users stop using schema - not sure how pig2.0 behaves in that case).
Regards, Mridul
Thanks, Olga