On Sat, Apr 4, 2009 at 2:11 PM, Christian Ulrik Søttrup <[email protected]>wrote:
> Hello all, > > I need to do some calculations that has to merge two sets of very large > data (basically calculate variance). > One set contains a set of "means" and the second a set of objects tied to > a mean. > > Normally I would send the set of means using the distributed cache, but > the set has become too large to keep in memory and it is going to grow in > the future. > Hi Christian, Others have done a good job answering your question about doing this as a join, but here's one idea that might allow you to skip the join altogether: If you're simply calculating variance of data sets, you can use a bit of a math trick to do it in one pass without precomputing the means: E = the expectation operator mu = mean = E[x] Variance = E[ (x - mu)^2 ] Expand the square: = E[x^2 - 2*x*mu + mu^2] by linearity of expectation: = E[x^2] - 2*mu*E[x] + E[mu^2] mu in this equation is constant, so E[mu^2] = mu^2. Also recall that E[x] = mu = E[x^2] - 2*E[x]^2 + E[x]^2 = E[x^2] - E[x]^2 Apologies for the ugly math notation, but hopefully it's clear. The takeaway is that you can separately calculate sum(x^2) and sum(x) in your job, and calculate variance directly from the results. Here's the general outline for the MR job: Map: collect (1, x, x^2) Combine: sum up tuples Reduce: input from combine: (N, sum(x), sum(x^2)) output: Variance = 1/N(sum(x^2)) - (1/N sum(x))^2 Hope that's helpful for you! -Todd
