Sorry, I forgot to mention that I was also running ClusterEvaluator
along with CDbwEvaluator yesterday. I was getting inter < intra with the
former, and the reverse with the latter, and even allowing for the fact
that they're different algorithms I was a bit suspicious of this.
However, I'd also turned off KMeans to speed up the process while
debugging, and was running CDbw directly on the generated Canopies. I've
run it today with Canopy + KMeans and 10 representative points per
cluster, and the results look better that lastnight, e.g.
CDbw = 178.90082114134046
Intra-cluster density = 0.26970259013839315
Inter-cluster density = 0.2517752232288567
Separation = 663.326299719779
I reran with Canopy only (to double-check) and this generated the
following (I assume overlapping canopies are causing inter to be greater
than intra):
CDbw = 97.38825072133858
Intra-cluster density = 0.23594531216994152
Inter-cluster density = 0.9859430197400473
Separation = 412.7577268888219
To confirm: it's running much better now than before, as the successful
KMeans evaluation above includes the cluster with the identical points
which are marginally different from the centre. Thanks for all the help
on this.
@Jeff: is ClusterEvaluator useful too, or should CDbwEvaluator be
considered more accurate?
On 29/09/10 20:51, Jeff Eastman wrote:
The CDbw intra-cluster density calculation uses the average std of
all the clusters in its normalization and produces an average of the
per-cluster intra-cluster densities so it's likely the other, looser
clusters are degrading that result.
CDbw is not a panacea either. The Dirichlet unit test, for example,
produces a very low metric despite clustering over the same data set
that the DisplayDirichlet uses. That clustering finds almost exactly
the parameters of the input data set (try it), but the clusters
overlap and that messes with CDbw density calculations which reward
mostly non-overlapping clusters.
I'm happy that the Online accumulator works better. I will plug it in
in my next commit and tweak the unit tests to accept its values.
On 9/29/10 2:59 PM, Ted Dunning wrote:
That slight difference case is exactly where the running sums approach
fails. I think that you found the problem.
Your suspicion about inter being larger than intra confuses me.
Isn't that
just another way of saying that clusters are very tight?
On Wed, Sep 29, 2010 at 11:19 AM, Derek
O'Callaghan<[email protected]
wrote:
I just stepped through invalidCluster(), and it seems that there's a
slight
difference between the centre and the other points, so it returns
false. I
was positive that there was no difference when I stepped through it
last, I
must have overlooked something, sorry about that.
I just tried the OnlineGaussianAccumulator and it does run better,
in that
I get values for the 4 metrics. One thing I need to check is why the
inter-density is so much bigger than the intra-, I'm getting the
following
values:
CDbw = 68.51761788802385
Intra-cluster density = 0.3734803797950363
Inter-cluster density = 3.474415557178534
Separation = 183.4570745741071
When using RunningSums and ignoring the identical points cluster, I
get a
similar issue in that inter = ~1.5, with intra = ~0.15. I have to
leave for
the evening, I'll look into it tomorrow to see if I can determine if
it's
correct.
Thanks again.
On 29/09/10 18:55, Jeff Eastman wrote:
If all of the representative points for that cluster are
identical then
they are also identical to the cluster center (the first
representative
point) and should be pruned. I'm wondering why this was not
detected in
invalidCluster, can you investigate that? You may also want to plug
in an
instance of the new OnlineGaussianAccumulator to see if it does any
better.
It is likely to me much more stable than the RunningSums...
On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
Thanks for that Jeff. I tried the changes and get the same result as
expected. FYI I've investigated further and it seems that all of
the points
in the affected cluster are identical, so it ends up as more or
less the
same problem we had last week with clusters with total points< #
representative points, in that there are duplicate representative
points. In
this case total> # representative, but the end result is the same.
I'm wondering if the quickest and easiest solution is to simply
ignore
such clusters, i.e. those that currently generate a NaN std? I'm
not sure if
it's the "correct" approach though...
On 29/09/10 17:37, Jeff Eastman wrote:
Hi Derek,
I've committed some changes which will hopefully help in fixing this
problem but which do not yet accomplish that. As you can see from
the new
CDbw test (testAlmostSameValueCluster) I tried creating a test
cluster with
points identical to the cluster center but with one which
differed from it
by Double.MIN_NORMAL in one element. That test failed to
duplicate your
issue.
The patch also factors out the std calculation into an
implementor of
GaussianAccumulator. I factored the current std calculations out of
CDbwEvaluator into RunningSumsGaussianAccumulator and all the
tests produced
the same results as before. With the new
OnlineGaussianAccumulator plugged
in, the tests all return slightly different results but still no
NaNs.
I still have not added priors and I'm not entirely sure where to do
that. I've committed the changes so you can see my quandary.
OnlineGaussianAccumulator is still a work in progress but, since
it is never
used it is in the commit for your viewing.
Jeff
On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
Thanks Jeff, I'll try out the changes when they're committed. I
tried a
couple of things locally (removing the clusters/setting a small
prior), but
I ended up with inter-density> intra-density, so I suspect I've
slipped up
somewhere. I'll hold off on it for now.
On 29/09/10 13:48, Jeff Eastman wrote:
Hi Derek,
That makes sense. With the very, very tight cluster that your
clustering produced you've uncovered an instability in that std
calculation.
I'm going to rework that method today to use a better algorithm
and will add
a small prior in the process. I'm also going to add a unit test
to reproduce
this problem first. Look for a commit in a couple of hours.
On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
Hi Jeff,
FYI I checked the problem I was having in CDbwEvaluator with
the same
dataset from the ClusterEvaluator thread, the problem is
occurring in the
std calculation in CDbwEvaluator.computeStd(), in that
s2.times(s0).minus(s1.times(s1)) generates negative values
which then
produce NaN with the subsequent SquareRootFunction(). This
then sets the
average std to NaN later on in intraClusterDensity(). It's
happening for the
cluster I have with the almost-identical points.
It's the same symptom as the problem last week, where this was
happening when s0 was 1. Is the solution to ignore these
clusters, like the
s0 = 1 clusters? Or to add a small prior std as was done for
the similar
issue in NormalModel.pdf()?
Thanks,
Derek
On 28/09/10 20:28, Jeff Eastman wrote:
Hi Ted,
The clustering code computes this value for cluster radius.
Currently, it is done with a running sums approach (s^0, s^1,
s^2) that
computes the std of each vector term using:
Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
SquareRootFunction()).divide(s0);
For CDbw, they need a scalar, average std value, and this is
currently computed by averaging the vector terms:
double d = std.zSum() / std.size();
The more I read about it; however, the less confident I am about
this approach. The paper itself seems to indicate a
covariance approach, but
I am lost in their notation. See page 5, just above
Definition 1.
www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf