[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Apache Wiki Fri, 10 Apr 2009 00:58:51 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by GuntherHagleitner:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
  ==== Map-only Splittees ====
  
  If a splittee is a map-only job (doesn't require join, cogroup, group, etc) 
the splittee is merged into
- the splitter - into either the map or reduce plan.
+ the splitter - into either the map or reduce plan. The use-case of storing 
temporary results during execution
+ falls into this category.
  
  The script:
  
@@ -335, +336 @@

  
  attachment:map-only.png
  
+ The same works in the reducer. The script:
+ 
+ {{{
+ A = load '/user/pig/tests/data/pigmix/page_views'
+     as (user, action, timespent, query_term, ip_addr, timestamp,
+         estimated_revenue, page_info, page_links);
+ B = group A by user;
+ C = foreach B generate group, MIN(A.timespent);
+ D = foreach B generate group, MAX(A.timespent);
+ store C into 'min_timespent';
+ store D into 'max_timespent';
+ }}}
+ 
+ Will be executed as:
+ 
+ attachment:reduce-only.png
+ 
+ [[Anchor(Map_reduce_splittee)]]
+ ==== Map-reduce Splittees ====
+ 
+ If a split happens in the map plan and one of the splitees is a 
map-(combine)-reduce job, the
+ map plan will be a combined plan of all the splittee and splitter map plans 
and the reduce job
+ will the the one of the map-(combine)-reduce job.
+ 
+ The script:
+ 
+ {{{
+ A = load '/user/pig/tests/data/pigmix/page_views'
+     as (user, action, timespent, query_term, ip_addr, timestamp,
+         estimated_revenue, page_info, page_links);
+ B = filter A by user is not null;
+ store B into 'filtered_user';
+ C = group B by action;
+ D = foreach C generate B.action, COUNT(B);
+ store D into 'count';
+ }}}
+ 
+ Will be executed as:
+ 
+ attachment:map-mapreduce.png
+ 
+ In a similar way, if multiple splittees are map-(combine)-reduce jobs the 
combine and reduce
+ plans are also merged.
+ 
+ The script:
+ 
+ {{{
+ A = load '/user/pig/tests/data/pigmix/page_views'
+     as (user, action, timespent, query_term, ip_addr, timestamp,
+         estimated_revenue, page_info, page_links);
+ B = group A by user;
+ C = foreach B generate A.user, MAX(A.estimated_revenue);
+ store C into 'highest_values';
+ D = group A by query_term;
+ E = foreach D generate group, SUM(A.timespent);
+ store E into 'total_time';
+ }}}
+ 
+ Will be executed as:
+ 
+ attachment:mapreduce.png
+ 
  [[Anchor(Phases)]]
  == Phases ==
  
@@ -465, +528 @@

  
  If we multiplex outputs from different split branches we have to decide what 
to do with the requested parallelism: Max, sum or average?
  
- [[Anchor(Diamond_problem_(Phase_3))]]
- ==== Diamond problem (Phase 3) ====
- What happens when different split plans come back together?
- 
- Should come for free. Need to make sure unions can handle multiple split 
branches.
-

[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Reply via email to