Re: Standard Deviation of a Set of Vectors

Jeff Eastman Wed, 29 Sep 2010 10:56:32 -0700

If all of the representative points for that cluster are identicalthen they are also identical to the cluster center (the firstrepresentative point) and should be pruned. I'm wondering why this wasnot detected in invalidCluster, can you investigate that? You may alsowant to plug in an instance of the new OnlineGaussianAccumulator to seeif it does any better. It is likely to me much more stable than theRunningSums...


On 9/29/10 1:45 PM, Derek O'Callaghan wrote:

Thanks for that Jeff. I tried the changes and get the same result asexpected. FYI I've investigated further and it seems that all of thepoints in the affected cluster are identical, so it ends up as more orless the same problem we had last week with clusters with total points< # representative points, in that there are duplicate representativepoints. In this case total > # representative, but the end result isthe same.
I'm wondering if the quickest and easiest solution is to simply ignoresuch clusters, i.e. those that currently generate a NaN std? I'm notsure if it's the "correct" approach though...
On 29/09/10 17:37, Jeff Eastman wrote:
 Hi Derek,
I've committed some changes which will hopefully help in fixing thisproblem but which do not yet accomplish that. As you can see from thenew CDbw test (testAlmostSameValueCluster) I tried creating a testcluster with points identical to the cluster center but with onewhich differed from it by Double.MIN_NORMAL in one element. That testfailed to duplicate your issue.
The patch also factors out the std calculation into an implementor ofGaussianAccumulator. I factored the current std calculations out ofCDbwEvaluator into RunningSumsGaussianAccumulator and all the testsproduced the same results as before. With the newOnlineGaussianAccumulator plugged in, the tests all return slightlydifferent results but still no NaNs.
I still have not added priors and I'm not entirely sure where to dothat. I've committed the changes so you can see my quandary.OnlineGaussianAccumulator is still a work in progress but, since itis never used it is in the commit for your viewing.
Jeff

On 9/29/10 11:13 AM, Derek O'Callaghan wrote:
Thanks Jeff, I'll try out the changes when they're committed. Itried a couple of things locally (removing the clusters/setting asmall prior), but I ended up with inter-density > intra-density, soI suspect I've slipped up somewhere. I'll hold off on it for now.
On 29/09/10 13:48, Jeff Eastman wrote:
 Hi Derek,
That makes sense. With the very, very tight cluster that yourclustering produced you've uncovered an instability in that stdcalculation. I'm going to rework that method today to use a betteralgorithm and will add a small prior in the process. I'm also goingto add a unit test to reproduce this problem first. Look for acommit in a couple of hours.
On 9/29/10 8:02 AM, Derek O'Callaghan wrote:
Hi Jeff,
FYI I checked the problem I was having in CDbwEvaluator with thesame dataset from the ClusterEvaluator thread, the problem isoccurring in the std calculation in CDbwEvaluator.computeStd(), inthat s2.times(s0).minus(s1.times(s1)) generates negative valueswhich then produce NaN with the subsequent SquareRootFunction().This then sets the average std to NaN later on inintraClusterDensity(). It's happening for the cluster I have withthe almost-identical points.
It's the same symptom as the problem last week, where this washappening when s0 was 1. Is the solution to ignore these clusters,like the s0 = 1 clusters? Or to add a small prior std as was donefor the similar issue in NormalModel.pdf()?
Thanks,

Derek

On 28/09/10 20:28, Jeff Eastman wrote:
 Hi Ted,
The clustering code computes this value for cluster radius.Currently, it is done with a running sums approach (s^0, s^1,s^2) that computes the std of each vector term using:
Vector std = s2.times(s0).minus(s1.times(s1)).assign(newSquareRootFunction()).divide(s0);
For CDbw, they need a scalar, average std value, and this iscurrently computed by averaging the vector terms:
double d = std.zSum() / std.size();
The more I read about it; however, the less confident I am aboutthis approach. The paper itself seems to indicate a covarianceapproach, but I am lost in their notation. See page 5, just aboveDefinition 1.
www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf

Re: Standard Deviation of a Set of Vectors

Reply via email to