In general, I try to avoid cross if possible - if it is simple enough to do so that is ... and it is trivial in this case.

Mridul

Richard Ding wrote:
You can also do this with one pig script:

A = load 'input.txt' USING PigStorage() as (id:chararray, value:long);
B = GROUP A all;
C = FOREACH B generate SUM(A.$1) as total;
D = CROSS A, C;
E = FOREACH E GENERATE A::id, ((double)A::value) / ((double)C::total);
STORE E into 'final_output.txt';

Thanks,
-Richard

-----Original Message-----
From: Mridul Muralidharan [mailto:[email protected]] Sent: Wednesday, January 06, 2010 1:15 PM
To: [email protected]
Subject: Re: convert raw metrics to percentages


From top of my head, I can see a group all which does the sum and stores it in hdfs; and a second job which reads this as a side file in a

udf to divide it (without multi-query optimization enabled).
You could also use scripting to pass the sum as a param if you are ok with splitting it into two scripts.

To calculate sum -
first_script.pig:
A = load 'input.txt' USING PigStorage() as (id:chararray, value:long);
B = GROUP A all;
C = FOREACH B generate SUM($1);
STORE C into 'output.path' USING PigStorage();

If using script, just do :
pig -param DIVIDEBY=`hadoop dfs -cat output.path/*` second_script.pig

second_script.pig:
A = load 'input.txt' USING PigStorage() as (id:chararray, value:long);
B = FOREACH A GENERATE id, ((double)value) / ((double)$DIVIDEBY);
STORE B into 'final_output.txt'




Regards,
Mridul


Xiaomeng Wan wrote:
Hi everyone, should be a simple task, but couldn't find an efficient
way to do it. I have a relation looks like:

a 3
b 10
c 7

I want to convert the raw metrics into percentages. the expected
relation is
a 0.15
b 0.5
c 0.7

what is the best way to pass the computed total into the generate
clause for the result relation?

Regards,
Shawn


Reply via email to