[ 
https://issues.apache.org/jira/browse/PIG-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052304#comment-13052304
 ] 

Scott Carey commented on PIG-1631:
----------------------------------

{quote}
P.S: I understand that I could have alternatively, flattened the fields of B 
and then done a GROUP on page_id and then iterated through the records calling 
'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK.
{quote}

What if the optimizer knew that an identical group right after such a flatten 
should be optimized as one M/R pass?  (Does Pig already do this optmization?)

Nesting foreach is more intuitive and much more succinct than doing extra 
groups however.

> Support to 2 level nested foreach
> ---------------------------------
>
>                 Key: PIG-1631
>                 URL: https://issues.apache.org/jira/browse/PIG-1631
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Aniket Mokashi
>              Labels: gsoc2011
>             Fix For: 0.10
>
>
> What I would like to do is generate certain metrics for every listing 
> impression in the context of a page like clicks on the page etc. So, I first 
> group by to get clicks and impression together. Now, I would want to iterate 
> through the mini-table (one per serve-id) and compute metrics. Since nested 
> foreach within foreach is not supported I ended up writing a UDF that took 
> both the bags and computed the metric. It would have been elegant to keep the 
> logic of iterating over the records outside in the PIG script. 
> Here is some pseudocode of how I would have liked to write it:
> {code}
> -- Let us say in our page context there was click on rank 2 for which there 
> were 3 ads 
> A1 = LOAD '...' AS (page_id, rank); -- clicks. 
> A2 = Load '...' AS (page_id, rank); -- impressions
> B = COGROUP A1 by (page_id), A2 by (page_id); 
> -- Let us say B contains the following schema 
> -- (group, {(A1...)} {(A2...)})  
> -- Each record would be in B would be:
> -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}
> C = FOREACH B GENERATE {
>                 D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current 
> pig as well. Basically, I would like a mini-table which represents an entire 
> serve. 
>                 FOREACH D GENERATE
>                         page_id_1,
>                         A2:rank,
>                         SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a 
> value (like v1, v2, v3 depending on A1::rank and A2::rank)
> };
> # output
> # page_id, 1, v1
> # page_id,  2, v2
> # page_id, 3, v3
> DUMP C;
> {code}
> P.S: I understand that I could have alternatively, flattened the fields of B 
> and then done a GROUP on page_id and then iterated through the records 
> calling 'SOMEUDF' appropriately but that would be 2 map-reduce operations 
> AFAIK. 
> This is a candidate project for Google summer of code 2011. More information 
> about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to