Philip (flip) Kromer created PIG-3919:
-----------------------------------------
Summary: Inner FOREACH of nested FOREACH should have access to
main-body aliases
Key: PIG-3919
URL: https://issues.apache.org/jira/browse/PIG-3919
Project: Pig
Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Philip (flip) Kromer
Pig should allow values calculated in the main body of a FOREACH to be accessed
by and inner nested FOREACH.
{code}
top_queries = LOAD './test/data/pigunit/top_queries_input_data.txt' AS
(site:chararray, hits:int);
-- yahoo 10
-- twitter 7
-- ...
top_queries_g = GROUP top_queries BY site;
-- BREAKS: Invalid field projection. Projected field [top_queries] does not
exist in schema: site:chararray,hits:int. - org.apache.pig.tools.grunt.Grunt
cant_use_values_in_inner_foreach = FOREACH top_queries_g {
n_sites = COUNT_STAR(top_queries);
hits_x = FOREACH top_queries GENERATE hits / n_sites;
GENERATE group AS site, n_sites, hits_x;
};
DUMP cant_use_values_in_inner_foreach;
{code}
-- This works, because n_sites behaves same regardless of scope
can_use_const_val = FOREACH top_queries_g {
n_sites = 3;
hits_x = FOREACH top_queries GENERATE hits / n_sites;
GENERATE group AS site, n_sites, hits_x;
};
DUMP can_use_const_val;
{code}
Pig handles the schema for the inner foreach in a very confusing way.
It should not allow statements in the main foreach body that aren't in the
main-body scope:
{code}
works_but_is_confusing = FOREACH top_queries_g {
namelen_g = SIZE(group);
namelen_s = SIZE(site); -- this should not work
-- but it does, because namelen_s gains right scope when evaluated
hits_x = FOREACH top_queries GENERATE namelen_s * hits;
-- instead, this should work, only evaluating namelen_g once
-- hits_x = FOREACH top_queries GENERATE namelen_g * hits;
-- if I used 'namelen_s' in this line, it would break.
GENERATE group AS site, namelen_g, hits_x;
};
DUMP works_but_is_confusing;
{code}
Here, the inner foreach precedes the declaration of 'site' in the main body:
{code}
-- declaring main-body site _after_ the inner foreach doesn't interfere
alias_means_two_things = FOREACH top_queries_g {
hits_x = FOREACH top_queries GENERATE SIZE(site)*hits; -- works
site = CONCAT(group, group);
namelen_s = SIZE(site);
GENERATE site, namelen_s, hits_x;
};
DUMP alias_means_two_things;
{code}
Simply switching the order of the lines causes an error -- the main body
declaration of site hides the inner-bag alias. Also, the error shows up on the
line in the main-body, which is very confusing.
{code}
-- BREAKS
main_body_hides_alias = FOREACH top_queries_g {
site = CONCAT(group, group); -- Projected field [group] does not exist
in schema: site:chararray,hits:int
namelen_s = SIZE(site);
hits_x = FOREACH top_queries GENERATE SIZE(site)*hits;
GENERATE site, namelen_s, hits_x;
};
DUMP main_body_hides_alias;
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)