Utkarsh Srivastava
Thu, 19 Jun 2008 12:53:04 -0700
You can write a function myFunc that outputs for a particular record which of the counts b1 .. b1000 it contributes to (it could even contribute to more than 1, in which case myFunc() should be a EvalFunc<DataBag>). Then A = load 'data' B = group a by flatten(myFunc(*)); C = foreach b generate group, count(a); Utkarsh -----Original Message----- From: [EMAIL PROTECTED] [EMAIL PROTECTED] On Behalf Of Prashanth Pappu Sent: Thursday, June 19, 2008 12:20 PM To: pig-user@incubator.apache.org Subject: Performance/coding question I have a PIG script that simply generates a lot of 'counts' over very large data. For example, a = load 'data' as (x,y,z); b1 = filter a by x==1; b1_group = group b1 all; b1_count = foreach b1_group generate COUNT(b1); b2 = filter a by y==1; b2_group = group b2 all; b2_count = foreach b2_group generate COUNT(b2); ...etc Suppose that we need to generate counts b1 to b1000. Now, PIG generates 1000 different hadoop jobs (one for each count). While each job finishes fast enough, the per job overhead considerably slows down the script. So, I have two questions (a) If I want to generate many counts by simply filtering the rows of the data - is there a better way to code this script? (b) Are there any PIG optimizations (current or planned) that will cause PIG to generate fewer number of jobs? Because, clearly, one can write a single java map-reduce job to accomplish the task. Thanks, Prashanth