You can also do this with one pig script: A = load 'input.txt' USING PigStorage() as (id:chararray, value:long); B = GROUP A all; C = FOREACH B generate SUM(A.$1) as total; D = CROSS A, C; E = FOREACH E GENERATE A::id, ((double)A::value) / ((double)C::total); STORE E into 'final_output.txt'
Thanks, -Richard -----Original Message----- From: Mridul Muralidharan [mailto:[email protected]] Sent: Wednesday, January 06, 2010 1:15 PM To: [email protected] Subject: Re: convert raw metrics to percentages From top of my head, I can see a group all which does the sum and stores it in hdfs; and a second job which reads this as a side file in a udf to divide it (without multi-query optimization enabled). You could also use scripting to pass the sum as a param if you are ok with splitting it into two scripts. To calculate sum - first_script.pig: A = load 'input.txt' USING PigStorage() as (id:chararray, value:long); B = GROUP A all; C = FOREACH B generate SUM($1); STORE C into 'output.path' USING PigStorage(); If using script, just do : pig -param DIVIDEBY=`hadoop dfs -cat output.path/*` second_script.pig second_script.pig: A = load 'input.txt' USING PigStorage() as (id:chararray, value:long); B = FOREACH A GENERATE id, ((double)value) / ((double)$DIVIDEBY); STORE B into 'final_output.txt' Regards, Mridul Xiaomeng Wan wrote: > Hi everyone, should be a simple task, but couldn't find an efficient > way to do it. I have a relation looks like: > > a 3 > b 10 > c 7 > > I want to convert the raw metrics into percentages. the expected relation is > > a 0.15 > b 0.5 > c 0.7 > > what is the best way to pass the computed total into the generate > clause for the result relation? > > Regards, > Shawn
