some speed wrote:
Thanks for the response. What I am trying is to do is finding the average
and then the standard deviation for a very large set (say a million) of
numbers. The result would be used in further calculations.
I have got the average from the first map-reduce chain. now i need to read
this average as well as the set of numbers to calculate the standard
deviation. so one file would have the input set and the other "resultant"
file would have just the average.
Please do tell me in case there is a better way of doing things than what i
am doing. Any input/suggestion is appreciated.:)
std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
Why dont you use the formula to compute it in one MR job.
std_dev^2 = (sum_i(Xi ^ 2) - N * (Xa ^ 2) ) / N;
= (A - N*(avg^2))/N
For this your map would look like
map (key, val) : output.collect(key^2, key); // imagine your input as
(k,v) = (Xi, null)
Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and
sum over the values to find out 'Xa'. You could use the close() api to
finally dump there 2 values to a file.
For example :
input : 1,2,3,4
Say input is split in 2 groups [1,2] and [4,5]
Now there will be 2 maps with output as follows
map1 output : (1,1) (4,2)
map2 output : (9,3) (16,4)
Reducer will maintain the sum over all keys and all values
A = sum(key i.e input squared) = 1+ 4 + 9 + 16 = 30
B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
With A and B you can compute the standard deviation offline.
So avg = B / N = 10/4 = 2.5
Hence the std deviation would be
sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
*Using the main formula the answer is *1.11803399*
Amar
On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:
Amar Kamat wrote:
some speed wrote:
I was wondering if it was possible to read the input for a map function
from
2 different files:
1st file ---> user-input file from a particular location(path)
Is the input/user file sorted? If yes then you can use "map-side join" for
performance reasons. See org.apache.hadoop.mapred.join for more details.
2nd file=---> A resultant file (has just one <key,value> pair) from a
previous MapReduce job. (I am implementing a chain MapReduce function)
Can you explain in more detail the contents of 2nd file?
Now, for every <key,value> pair in the user-input file, I would like to
use
the same <key,value> pair from the 2nd file for some calculations.
Can you explain this in more detail? Can you give some abstracted example
of how file1 and file2 look like and what operation/processing you want to
do?
I guess you might need to do some kind of join on the 2 files. Look at
contrib/data_join for more details.
Amar
Is it possible for me to do so? Can someone guide me in the right
direction
please?
Thanks!