Re: Standard Deviation of a Set of Vectors

Derek O'Callaghan Wed, 29 Sep 2010 11:19:33 -0700

I just stepped through invalidCluster(), and it seems that there's aslight difference between the centre and the other points, so it returnsfalse. I was positive that there was no difference when I steppedthrough it last, I must have overlooked something, sorry about that.

I just tried the OnlineGaussianAccumulator and it does run better, inthat I get values for the 4 metrics. One thing I need to check is whythe inter-density is so much bigger than the intra-, I'm getting thefollowing values:


CDbw = 68.51761788802385
Intra-cluster density = 0.3734803797950363
Inter-cluster density = 3.474415557178534
Separation = 183.4570745741071

When using RunningSums and ignoring the identical points cluster, I geta similar issue in that inter = ~1.5, with intra = ~0.15. I have toleave for the evening, I'll look into it tomorrow to see if I candetermine if it's correct.


Thanks again.

On 29/09/10 18:55, Jeff Eastman wrote:

If all of the representative points for that cluster are identicalthen they are also identical to the cluster center (the firstrepresentative point) and should be pruned. I'm wondering why this wasnot detected in invalidCluster, can you investigate that? You may alsowant to plug in an instance of the new OnlineGaussianAccumulator tosee if it does any better. It is likely to me much more stable thanthe RunningSums...
On 9/29/10 1:45 PM, Derek O'Callaghan wrote:
Thanks for that Jeff. I tried the changes and get the same result asexpected. FYI I've investigated further and it seems that all of thepoints in the affected cluster are identical, so it ends up as moreor less the same problem we had last week with clusters with totalpoints < # representative points, in that there are duplicaterepresentative points. In this case total > # representative, but theend result is the same.
I'm wondering if the quickest and easiest solution is to simplyignore such clusters, i.e. those that currently generate a NaN std?I'm not sure if it's the "correct" approach though...
On 29/09/10 17:37, Jeff Eastman wrote:
 Hi Derek,
I've committed some changes which will hopefully help in fixing thisproblem but which do not yet accomplish that. As you can see fromthe new CDbw test (testAlmostSameValueCluster) I tried creating atest cluster with points identical to the cluster center but withone which differed from it by Double.MIN_NORMAL in one element. Thattest failed to duplicate your issue.
The patch also factors out the std calculation into an implementorof GaussianAccumulator. I factored the current std calculations outof CDbwEvaluator into RunningSumsGaussianAccumulator and all thetests produced the same results as before. With the newOnlineGaussianAccumulator plugged in, the tests all return slightlydifferent results but still no NaNs.
I still have not added priors and I'm not entirely sure where to dothat. I've committed the changes so you can see my quandary.OnlineGaussianAccumulator is still a work in progress but, since itis never used it is in the commit for your viewing.
Jeff

On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
Thanks Jeff, I'll try out the changes when they're committed. Itried a couple of things locally (removing the clusters/setting asmall prior), but I ended up with inter-density > intra-density, soI suspect I've slipped up somewhere. I'll hold off on it for now.
On 29/09/10 13:48, Jeff Eastman wrote:
 Hi Derek,
That makes sense. With the very, very tight cluster that yourclustering produced you've uncovered an instability in that stdcalculation. I'm going to rework that method today to use a betteralgorithm and will add a small prior in the process. I'm alsogoing to add a unit test to reproduce this problem first. Look fora commit in a couple of hours.
On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
Hi Jeff,
FYI I checked the problem I was having in CDbwEvaluator with thesame dataset from the ClusterEvaluator thread, the problem isoccurring in the std calculation in CDbwEvaluator.computeStd(),in that s2.times(s0).minus(s1.times(s1)) generates negativevalues which then produce NaN with the subsequentSquareRootFunction(). This then sets the average std to NaN lateron in intraClusterDensity(). It's happening for the cluster Ihave with the almost-identical points.
It's the same symptom as the problem last week, where this washappening when s0 was 1. Is the solution to ignore theseclusters, like the s0 = 1 clusters? Or to add a small prior stdas was done for the similar issue in NormalModel.pdf()?
Thanks,

Derek

On 28/09/10 20:28, Jeff Eastman wrote:
 Hi Ted,
The clustering code computes this value for cluster radius.Currently, it is done with a running sums approach (s^0, s^1,s^2) that computes the std of each vector term using:
Vector std = s2.times(s0).minus(s1.times(s1)).assign(newSquareRootFunction()).divide(s0);
For CDbw, they need a scalar, average std value, and this iscurrently computed by averaging the vector terms:
double d = std.zSum() / std.size();
The more I read about it; however, the less confident I am aboutthis approach. The paper itself seems to indicate a covarianceapproach, but I am lost in their notation. See page 5, justabove Definition 1.
www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf

Re: Standard Deviation of a Set of Vectors

Reply via email to