Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by GuntherHagleitner: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification ------------------------------------------------------------------------------ ==== Map-only Splittees ==== If a splittee is a map-only job (doesn't require join, cogroup, group, etc) the splittee is merged into - the splitter - into either the map or reduce plan. + the splitter - into either the map or reduce plan. The use-case of storing temporary results during execution + falls into this category. The script: @@ -335, +336 @@ attachment:map-only.png + The same works in the reducer. The script: + + {{{ + A = load '/user/pig/tests/data/pigmix/page_views' + as (user, action, timespent, query_term, ip_addr, timestamp, + estimated_revenue, page_info, page_links); + B = group A by user; + C = foreach B generate group, MIN(A.timespent); + D = foreach B generate group, MAX(A.timespent); + store C into 'min_timespent'; + store D into 'max_timespent'; + }}} + + Will be executed as: + + attachment:reduce-only.png + + [[Anchor(Map_reduce_splittee)]] + ==== Map-reduce Splittees ==== + + If a split happens in the map plan and one of the splitees is a map-(combine)-reduce job, the + map plan will be a combined plan of all the splittee and splitter map plans and the reduce job + will the the one of the map-(combine)-reduce job. + + The script: + + {{{ + A = load '/user/pig/tests/data/pigmix/page_views' + as (user, action, timespent, query_term, ip_addr, timestamp, + estimated_revenue, page_info, page_links); + B = filter A by user is not null; + store B into 'filtered_user'; + C = group B by action; + D = foreach C generate B.action, COUNT(B); + store D into 'count'; + }}} + + Will be executed as: + + attachment:map-mapreduce.png + + In a similar way, if multiple splittees are map-(combine)-reduce jobs the combine and reduce + plans are also merged. + + The script: + + {{{ + A = load '/user/pig/tests/data/pigmix/page_views' + as (user, action, timespent, query_term, ip_addr, timestamp, + estimated_revenue, page_info, page_links); + B = group A by user; + C = foreach B generate A.user, MAX(A.estimated_revenue); + store C into 'highest_values'; + D = group A by query_term; + E = foreach D generate group, SUM(A.timespent); + store E into 'total_time'; + }}} + + Will be executed as: + + attachment:mapreduce.png + [[Anchor(Phases)]] == Phases == @@ -465, +528 @@ If we multiplex outputs from different split branches we have to decide what to do with the requested parallelism: Max, sum or average? - [[Anchor(Diamond_problem_(Phase_3))]] - ==== Diamond problem (Phase 3) ==== - What happens when different split plans come back together? - - Should come for free. Need to make sure unions can handle multiple split branches. -