Re: Standard Deviation of a Set of Vectors

Derek O'Callaghan Thu, 30 Sep 2010 08:38:48 -0700

Thanks for the tip, I had been generating the representative pointssequentially but was still using the MR versions of the clusteringalgorithms, I'll change that now.

Regarding ClusterEvaluator, it seems to rely onRepresentativePointsDriver having been run already, as it loads these inthe ClusterEvaluator(Configuration conf, Path clustersIn) constructor? Isee another constructor ClusterEvaluator(Map<Integer,List<VectorWritable>> representativePoints, List<Cluster> clusters,DistanceMeasure measure) where you can specify these points, but this ismarked as "test only". Is it okay to use this, passing in the clustercentres, or will it ultimately be removed?

I guess the question is, can ClusterEvaluator.intraClusterDensity() beused, given that it relies on a set of points, and not just the centrewhich is all that's required in interClusterDensity()? FYI I had tomodify my local copy to ignore my "identical points" cluster as it wasgenerating a NaN density.



On 30/09/10 14:38, Jeff Eastman wrote:

The ClusterEvaluator compares the cluster centers directly to computethe inter-cluster density whereas the CDbwEvaluator computes it bycounting the representative points which are within 2 stds (reallystdI + stdJ) from the center of a line segment between the closestrepresentative points of their respective clusters. These are verydifferent approaches and should be expected to yield different values.
The ClusterEvaluator uses the algorithm from "Mahout in Action". Idon't know which is "more accurate", the CDbw approach is certainlymore sophisticated as it is measuring the density of representativepoints in the inter-cluster region. Pick your poison.
You know, you can run all the clustering algorithms in sequential mode(-xm sequential) to improve performance if the data you are munchingis not too huge. The same is true for the representative pointscalculations. See TestClusterEvaluator.testRepresentativePoints() foran example of this.
On 9/30/10 8:08 AM, Derek O'Callaghan wrote:
Sorry, I forgot to mention that I was also running ClusterEvaluatoralong with CDbwEvaluator yesterday. I was getting inter < intra withthe former, and the reverse with the latter, and even allowing forthe fact that they're different algorithms I was a bit suspicious ofthis.
However, I'd also turned off KMeans to speed up the process whiledebugging, and was running CDbw directly on the generated Canopies.I've run it today with Canopy + KMeans and 10 representative pointsper cluster, and the results look better that lastnight, e.g.
CDbw = 178.90082114134046
Intra-cluster density = 0.26970259013839315
Inter-cluster density = 0.2517752232288567
Separation = 663.326299719779
I reran with Canopy only (to double-check) and this generated thefollowing (I assume overlapping canopies are causing inter to begreater than intra):
CDbw = 97.38825072133858
Intra-cluster density = 0.23594531216994152
Inter-cluster density = 0.9859430197400473
Separation = 412.7577268888219
To confirm: it's running much better now than before, as thesuccessful KMeans evaluation above includes the cluster with theidentical points which are marginally different from the centre.Thanks for all the help on this.
@Jeff: is ClusterEvaluator useful too, or should CDbwEvaluator beconsidered more accurate?
On 29/09/10 20:51, Jeff Eastman wrote:
The CDbw intra-cluster density calculation uses the average std ofall the clusters in its normalization and produces an average of theper-cluster intra-cluster densities so it's likely the other, looserclusters are degrading that result.
CDbw is not a panacea either. The Dirichlet unit test, for example,produces a very low metric despite clustering over the same data setthat the DisplayDirichlet uses. That clustering finds almost exactlythe parameters of the input data set (try it), but the clustersoverlap and that messes with CDbw density calculations which rewardmostly non-overlapping clusters.
I'm happy that the Online accumulator works better. I will plug itin in my next commit and tweak the unit tests to accept its values.
On 9/29/10 2:59 PM, Ted Dunning wrote:
That slight difference case is exactly where the running sums approach
fails.  I think that you found the problem.
Your suspicion about inter being larger than intra confuses me.Isn't that
just another way of saying that clusters are very tight?
On Wed, Sep 29, 2010 at 11:19 AM, DerekO'Callaghan<[email protected]
wrote:
I just stepped through invalidCluster(), and it seems that there'sa slightdifference between the centre and the other points, so it returnsfalse. Iwas positive that there was no difference when I stepped throughit last, I
must have overlooked something, sorry about that.
I just tried the OnlineGaussianAccumulator and it does run better,in that
I get values for the 4 metrics. One thing I need to check is why the
inter-density is so much bigger than the intra-, I'm getting thefollowing
values:

CDbw = 68.51761788802385
Intra-cluster density = 0.3734803797950363
Inter-cluster density = 3.474415557178534
Separation = 183.4570745741071
When using RunningSums and ignoring the identical points cluster,I get asimilar issue in that inter = ~1.5, with intra = ~0.15. I have toleave forthe evening, I'll look into it tomorrow to see if I can determineif it's
correct.

Thanks again.

On 29/09/10 18:55, Jeff Eastman wrote:
If all of the representative points for that cluster areidentical thenthey are also identical to the cluster center (the firstrepresentativepoint) and should be pruned. I'm wondering why this was notdetected ininvalidCluster, can you investigate that? You may also want toplug in aninstance of the new OnlineGaussianAccumulator to see if it doesany better.
It is likely to me much more stable than the RunningSums...

On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
Thanks for that Jeff. I tried the changes and get the sameresult asexpected. FYI I've investigated further and it seems that all ofthe pointsin the affected cluster are identical, so it ends up as more orless the
same problem we had last week with clusters with total points<  #
representative points, in that there are duplicaterepresentative points. In
this case total>  # representative, but the end result is the same.
I'm wondering if the quickest and easiest solution is to simplyignoresuch clusters, i.e. those that currently generate a NaN std? I'mnot sure if
it's the "correct" approach though...



On 29/09/10 17:37, Jeff Eastman wrote:
  Hi Derek,
I've committed some changes which will hopefully help in fixingthisproblem but which do not yet accomplish that. As you can seefrom the newCDbw test (testAlmostSameValueCluster) I tried creating a testcluster withpoints identical to the cluster center but with one whichdiffered from itby Double.MIN_NORMAL in one element. That test failed toduplicate your
issue.
The patch also factors out the std calculation into animplementor ofGaussianAccumulator. I factored the current std calculationsout ofCDbwEvaluator into RunningSumsGaussianAccumulator and all thetests producedthe same results as before. With the newOnlineGaussianAccumulator pluggedin, the tests all return slightly different results but stillno NaNs.
I still have not added priors and I'm not entirely sure whereto do
that. I've committed the changes so you can see my quandary.
OnlineGaussianAccumulator is still a work in progress but,since it is never
used it is in the commit for your viewing.

Jeff

On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
Thanks Jeff, I'll try out the changes when they're committed.I tried acouple of things locally (removing the clusters/setting asmall prior), butI ended up with inter-density> intra-density, so I suspectI've slipped up
somewhere. I'll hold off on it for now.

On 29/09/10 13:48, Jeff Eastman wrote:
  Hi Derek,

That makes sense. With the very, very tight cluster that your
clustering produced you've uncovered an instability in thatstd calculation.I'm going to rework that method today to use a betteralgorithm and will adda small prior in the process. I'm also going to add a unittest to reproduce
this problem first. Look for a commit in a couple of hours.



On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
Hi Jeff,
FYI I checked the problem I was having in CDbwEvaluator withthe samedataset from the ClusterEvaluator thread, the problem isoccurring in the
std calculation in CDbwEvaluator.computeStd(), in that
s2.times(s0).minus(s1.times(s1)) generates negative valueswhich thenproduce NaN with the subsequent SquareRootFunction(). Thisthen sets theaverage std to NaN later on in intraClusterDensity(). It'shappening for the
cluster I have with the almost-identical points.

It's the same symptom as the problem last week, where this was
happening when s0 was 1. Is the solution to ignore theseclusters, like thes0 = 1 clusters? Or to add a small prior std as was done forthe similar
issue in NormalModel.pdf()?

Thanks,

Derek

On 28/09/10 20:28, Jeff Eastman wrote:
  Hi Ted,

The clustering code computes this value for cluster radius.
Currently, it is done with a running sums approach (s^0,s^1, s^2) that
computes the std of each vector term using:

Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
SquareRootFunction()).divide(s0);

For CDbw, they need a scalar, average std value, and this is
currently computed by averaging the vector terms:

double d = std.zSum() / std.size();
The more I read about it; however, the less confident I amaboutthis approach. The paper itself seems to indicate acovariance approach, butI am lost in their notation. See page 5, just aboveDefinition 1.
www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf

Re: Standard Deviation of a Set of Vectors

Reply via email to