Hi Derek,
That makes sense. With the very, very tight cluster that your clustering
produced you've uncovered an instability in that std calculation. I'm
going to rework that method today to use a better algorithm and will add
a small prior in the process. I'm also going to add a unit test to
reproduce this problem first. Look for a commit in a couple of hours.
On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
Hi Jeff,
FYI I checked the problem I was having in CDbwEvaluator with the same
dataset from the ClusterEvaluator thread, the problem is occurring in
the std calculation in CDbwEvaluator.computeStd(), in that
s2.times(s0).minus(s1.times(s1)) generates negative values which then
produce NaN with the subsequent SquareRootFunction(). This then sets
the average std to NaN later on in intraClusterDensity(). It's
happening for the cluster I have with the almost-identical points.
It's the same symptom as the problem last week, where this was
happening when s0 was 1. Is the solution to ignore these clusters,
like the s0 = 1 clusters? Or to add a small prior std as was done for
the similar issue in NormalModel.pdf()?
Thanks,
Derek
On 28/09/10 20:28, Jeff Eastman wrote:
Hi Ted,
The clustering code computes this value for cluster radius.
Currently, it is done with a running sums approach (s^0, s^1, s^2)
that computes the std of each vector term using:
Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
SquareRootFunction()).divide(s0);
For CDbw, they need a scalar, average std value, and this is
currently computed by averaging the vector terms:
double d = std.zSum() / std.size();
The more I read about it; however, the less confident I am about this
approach. The paper itself seems to indicate a covariance approach,
but I am lost in their notation. See page 5, just above Definition 1.
www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf