Chris Olston
Thu, 19 Jun 2008 12:31:29 -0700
Prashanth,You can write it as a single group-by program, using a custom function to assign tuples to groups (i.e., if x==1, it assigns to a first group; if y==1, it assigns to a second group, and so on) -- if you require a single tuple to be placed into multiple groups, the function can output multiple groups for a single input.
It would look like this: a = load ...; b = foreach a generate flatten(my_group_func(*)); c = group b by $0; d = foreach c generate group, COUNT(c); -Chris On Jun 19, 2008, at 12:20 PM, Prashanth Pappu wrote:
I have a PIG script that simply generates a lot of 'counts' over very largedata. For example, a = load 'data' as (x,y,z); b1 = filter a by x==1; b1_group = group b1 all; b1_count = foreach b1_group generate COUNT(b1); b2 = filter a by y==1; b2_group = group b2 all; b2_count = foreach b2_group generate COUNT(b2); ...etcSuppose that we need to generate counts b1 to b1000. Now, PIG generates 1000 different hadoop jobs (one for each count). While each job finishes fast enough, the per job overhead considerably slows down the script. So, I havetwo questions(a) If I want to generate many counts by simply filtering the rows of thedata - is there a better way to code this script?(b) Are there any PIG optimizations (current or planned) that will cause PIG to generate fewer number of jobs? Because, clearly, one can write asingle java map-reduce job to accomplish the task. Thanks, Prashanth
-- Christopher Olston, Ph.D. Sr. Research Scientist Yahoo! Research