On Tue, Sep 28, 2010 at 12:28 PM, Jeff Eastman
<[email protected]>wrote:

>  Hi Ted,
>
> The clustering code computes this value for cluster radius. Currently, it
> is done with a running sums approach (s^0, s^1, s^2) that computes the std
> of each vector term using:
>
> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
> SquareRootFunction()).divide(s0);
>

This is algebraicly correct, but can be subject to bad round-off error.

I would recommend Welford's method which keeps a running sum of s^0, the
current mean and the current std.  Then to update counts, means and variance
for cluster i which has a new distance r:

    double n = counts.getQuick(i) + 1;
    counts.setQuick(i, n);

    double oldMean = means.getQuick(i);
    double newMean = oldMean + (r - oldMean) / n;
    means.setQuick(i, newMean);

    double var = variance.getQuick(i);
    variance.setQuick(i, var + (r - oldMean) * (r - newMean) / n);

The variance is the square of the std, of course.  If you initialize it with
a small number and set counts to something non-zero, you have effectively
added a prior on std (and mean unless you keep the non-zeroness separate).
 If the initial count is <1, then the prior has very little weight or
effect, except in the nasty cases.

See
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
.


> For CDbw, they need a scalar, average std value, and this is currently
> computed by averaging the vector terms:
>
> double d = std.zSum() / std.size();
>
> The more I read about it; however, the less confident I am about this
> approach.


That makes reasonable sense.


> The paper itself seems to indicate a covariance approach, but I am lost in
> their notation. See page 5, just above Definition 1.
>
>
> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf
>
> I will take a quick glance.

Reply via email to