[ 
https://issues.apache.org/jira/browse/PIG-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1631:
--------------------------------

    Fix Version/s: 0.10

> Support to 2 level nested foreach
> ---------------------------------
>
>                 Key: PIG-1631
>                 URL: https://issues.apache.org/jira/browse/PIG-1631
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Viraj Bhat
>             Fix For: 0.10
>
>
> What I would like to do is generate certain metrics for every listing 
> impression in the context of a page like clicks on the page etc. So, I first 
> group by to get clicks and impression together. Now, I would want to iterate 
> through the mini-table (one per serve-id) and compute metrics. Since nested 
> foreach within foreach is not supported I ended up writing a UDF that took 
> both the bags and computed the metric. It would have been elegant to keep the 
> logic of iterating over the records outside in the PIG script. 
> Here is some pseudocode of how I would have liked to write it:
> {code}
> -- Let us say in our page context there was click on rank 2 for which there 
> were 3 ads 
> A1 = LOAD '...' AS (page_id, rank); -- clicks. 
> A2 = Load '...' AS (page_id, rank); -- impressions
> B = COGROUP A1 by (page_id), A2 by (page_id); 
> -- Let us say B contains the following schema 
> -- (group, {(A1...)} {(A2...)})  
> -- Each record would be in B would be:
> -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}
> C = FOREACH B GENERATE {
>                 D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current 
> pig as well. Basically, I would like a mini-table which represents an entire 
> serve. 
>                 FOREACH D GENERATE
>                         page_id_1,
>                         A2:rank,
>                         SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a 
> value (like v1, v2, v3 depending on A1::rank and A2::rank)
> };
> # output
> # page_id, 1, v1
> # page_id,  2, v2
> # page_id, 3, v3
> DUMP C;
> {code}
> P.S: I understand that I could have alternatively, flattened the fields of B 
> and then done a GROUP on page_id and then iterated through the records 
> calling 'SOMEUDF' appropriately but that would be 2 map-reduce operations 
> AFAIK. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to