[
https://issues.apache.org/jira/browse/PIG-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237368#comment-13237368
]
Dmitriy V. Ryaboy commented on PIG-2610:
----------------------------------------
Ok so the Jira I *meant* to ask to open on this wasn't about a GC error (just
push the filter above the group), but about the fact that the optimizer can do
this automatically, with a little bit of trickiness (the filters need to be
turned into generates, and the counts into sums).
> GC errors on using FILTER within nested FOREACH
> -----------------------------------------------
>
> Key: PIG-2610
> URL: https://issues.apache.org/jira/browse/PIG-2610
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.9.1
> Reporter: Prashant Kommireddi
>
> User has reported running into GC overhead errors while trying to use FILTER
> within FOREACH and aggregating the filtered field. Here is the sample
> PigLatin script provided by the user that generated this issue.
> {code}
> raw = LOAD 'input' using MyCustomLoader();
> searches = FOREACH raw GENERATE
> day, searchType,
> FLATTEN(impBag) AS (adType, clickCount)
> ;
> groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
> counts = FOREACH groupedSearches{
> type1 = FILTER searches BY adType == 'type1';
> type2 = FILTER searches BY adType == 'type2';
> GENERATE
> FLATTEN(group) AS (day, searchType),
> COUNT(searches) numSearches,
> SUM(clickCount) AS clickCountPerSearchType,
> SUM(type1.clickCount) AS type1ClickCount,
> SUM(type2.clickCount) AS type2ClickCount;
> };
> {code}
> Pig should be able to handle this case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira