Re: joining two large files in hadoop

Christian Ulrik Søttrup Sun, 05 Apr 2009 07:13:17 -0700

Hi Todd,

Thanks, I already know that trick. I probably shouldn't have said means.Actually it is a combined high dimensional matrix representing the"mean" function of a bunch of other matrices.Then I need to find the mean difference and variance between theconstituent matrices and the combined matrix. Here I can use your trick.


cheers,
Christian

Todd Lipcon wrote:

On Sat, Apr 4, 2009 at 2:11 PM, Christian Ulrik Søttrup <[email protected]>wrote:

Hello all,

I need to do some calculations that has to merge two sets of very large
data  (basically calculate variance).
One set contains a set of "means" and the second  a set of objects tied to
a mean.

Normally I would  send the set of means using the distributed cache, but
the set has become too large to keep in memory and it is going to grow in
the future.


Hi Christian,

Others have done a good job answering your question about doing this as a
join, but here's one idea that might allow you to skip the join altogether:

If you're simply calculating variance of data sets, you can use a bit of a
math trick to do it in one pass without precomputing the means:

E = the expectation operator
mu = mean = E[x]
Variance = E[ (x - mu)^2 ]

Expand the square:
= E[x^2 - 2*x*mu + mu^2]

by linearity of expectation:
= E[x^2] - 2*mu*E[x] + E[mu^2]

mu in this equation is constant, so E[mu^2] = mu^2.
Also recall that E[x] = mu
= E[x^2] - 2*E[x]^2 + E[x]^2
= E[x^2] - E[x]^2

Apologies for the ugly math notation, but hopefully it's clear. The takeaway
is that you can separately calculate sum(x^2) and sum(x) in your job, and
calculate variance directly from the results. Here's the general outline for
the MR job:

Map:
  collect (1, x, x^2)
Combine:
  sum up tuples
Reduce:
  input from combine: (N, sum(x), sum(x^2))
  output: Variance = 1/N(sum(x^2)) - (1/N sum(x))^2

Hope that's helpful for you!

-Todd

Re: joining two large files in hadoop

Reply via email to