On Tue, Sep 28, 2010 at 12:28 PM, Jeff Eastman
<[email protected]>wrote:
> Hi Ted,
>
> The clustering code computes this value for cluster radius. Currently, it
> is done with a running sums approach (s^0, s^1, s^2) that computes the std
> of each vector term using:
>
> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
> SquareRootFunction()).divide(s0);
>
This is algebraicly correct, but can be subject to bad round-off error.
I would recommend Welford's method which keeps a running sum of s^0, the
current mean and the current std. Then to update counts, means and variance
for cluster i which has a new distance r:
double n = counts.getQuick(i) + 1;
counts.setQuick(i, n);
double oldMean = means.getQuick(i);
double newMean = oldMean + (r - oldMean) / n;
means.setQuick(i, newMean);
double var = variance.getQuick(i);
variance.setQuick(i, var + (r - oldMean) * (r - newMean) / n);
The variance is the square of the std, of course. If you initialize it with
a small number and set counts to something non-zero, you have effectively
added a prior on std (and mean unless you keep the non-zeroness separate).
If the initial count is <1, then the prior has very little weight or
effect, except in the nasty cases.
See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
.
> For CDbw, they need a scalar, average std value, and this is currently
> computed by averaging the vector terms:
>
> double d = std.zSum() / std.size();
>
> The more I read about it; however, the less confident I am about this
> approach.
That makes reasonable sense.
> The paper itself seems to indicate a covariance approach, but I am lost in
> their notation. See page 5, just above Definition 1.
>
>
> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>
> I will take a quick glance.