If all of the representative points for that cluster are identical
then they are also identical to the cluster center (the first
representative point) and should be pruned. I'm wondering why this was
not detected in invalidCluster, can you investigate that? You may also
want to plug in an instance of the new OnlineGaussianAccumulator to see
if it does any better. It is likely to me much more stable than the
RunningSums...
On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
Thanks for that Jeff. I tried the changes and get the same result as
expected. FYI I've investigated further and it seems that all of the
points in the affected cluster are identical, so it ends up as more or
less the same problem we had last week with clusters with total points
< # representative points, in that there are duplicate representative
points. In this case total > # representative, but the end result is
the same.
I'm wondering if the quickest and easiest solution is to simply ignore
such clusters, i.e. those that currently generate a NaN std? I'm not
sure if it's the "correct" approach though...
On 29/09/10 17:37, Jeff Eastman wrote:
Hi Derek,
I've committed some changes which will hopefully help in fixing this
problem but which do not yet accomplish that. As you can see from the
new CDbw test (testAlmostSameValueCluster) I tried creating a test
cluster with points identical to the cluster center but with one
which differed from it by Double.MIN_NORMAL in one element. That test
failed to duplicate your issue.
The patch also factors out the std calculation into an implementor of
GaussianAccumulator. I factored the current std calculations out of
CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests
produced the same results as before. With the new
OnlineGaussianAccumulator plugged in, the tests all return slightly
different results but still no NaNs.
I still have not added priors and I'm not entirely sure where to do
that. I've committed the changes so you can see my quandary.
OnlineGaussianAccumulator is still a work in progress but, since it
is never used it is in the commit for your viewing.
Jeff
On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
Thanks Jeff, I'll try out the changes when they're committed. I
tried a couple of things locally (removing the clusters/setting a
small prior), but I ended up with inter-density > intra-density, so
I suspect I've slipped up somewhere. I'll hold off on it for now.
On 29/09/10 13:48, Jeff Eastman wrote:
Hi Derek,
That makes sense. With the very, very tight cluster that your
clustering produced you've uncovered an instability in that std
calculation. I'm going to rework that method today to use a better
algorithm and will add a small prior in the process. I'm also going
to add a unit test to reproduce this problem first. Look for a
commit in a couple of hours.
On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
Hi Jeff,
FYI I checked the problem I was having in CDbwEvaluator with the
same dataset from the ClusterEvaluator thread, the problem is
occurring in the std calculation in CDbwEvaluator.computeStd(), in
that s2.times(s0).minus(s1.times(s1)) generates negative values
which then produce NaN with the subsequent SquareRootFunction().
This then sets the average std to NaN later on in
intraClusterDensity(). It's happening for the cluster I have with
the almost-identical points.
It's the same symptom as the problem last week, where this was
happening when s0 was 1. Is the solution to ignore these clusters,
like the s0 = 1 clusters? Or to add a small prior std as was done
for the similar issue in NormalModel.pdf()?
Thanks,
Derek
On 28/09/10 20:28, Jeff Eastman wrote:
Hi Ted,
The clustering code computes this value for cluster radius.
Currently, it is done with a running sums approach (s^0, s^1,
s^2) that computes the std of each vector term using:
Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
SquareRootFunction()).divide(s0);
For CDbw, they need a scalar, average std value, and this is
currently computed by averaging the vector terms:
double d = std.zSum() / std.size();
The more I read about it; however, the less confident I am about
this approach. The paper itself seems to indicate a covariance
approach, but I am lost in their notation. See page 5, just above
Definition 1.
www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf