That slight difference case is exactly where the running sums approach fails. I think that you found the problem.
Your suspicion about inter being larger than intra confuses me. Isn't that just another way of saying that clusters are very tight? On Wed, Sep 29, 2010 at 11:19 AM, Derek O'Callaghan <[email protected] > wrote: > I just stepped through invalidCluster(), and it seems that there's a slight > difference between the centre and the other points, so it returns false. I > was positive that there was no difference when I stepped through it last, I > must have overlooked something, sorry about that. > > I just tried the OnlineGaussianAccumulator and it does run better, in that > I get values for the 4 metrics. One thing I need to check is why the > inter-density is so much bigger than the intra-, I'm getting the following > values: > > CDbw = 68.51761788802385 > Intra-cluster density = 0.3734803797950363 > Inter-cluster density = 3.474415557178534 > Separation = 183.4570745741071 > > When using RunningSums and ignoring the identical points cluster, I get a > similar issue in that inter = ~1.5, with intra = ~0.15. I have to leave for > the evening, I'll look into it tomorrow to see if I can determine if it's > correct. > > Thanks again. > > On 29/09/10 18:55, Jeff Eastman wrote: > >> If all of the representative points for that cluster are identical then >> they are also identical to the cluster center (the first representative >> point) and should be pruned. I'm wondering why this was not detected in >> invalidCluster, can you investigate that? You may also want to plug in an >> instance of the new OnlineGaussianAccumulator to see if it does any better. >> It is likely to me much more stable than the RunningSums... >> >> On 9/29/10 1:45 PM, Derek O'Callaghan wrote: >> >>> Thanks for that Jeff. I tried the changes and get the same result as >>> expected. FYI I've investigated further and it seems that all of the points >>> in the affected cluster are identical, so it ends up as more or less the >>> same problem we had last week with clusters with total points < # >>> representative points, in that there are duplicate representative points. In >>> this case total > # representative, but the end result is the same. >>> >>> I'm wondering if the quickest and easiest solution is to simply ignore >>> such clusters, i.e. those that currently generate a NaN std? I'm not sure if >>> it's the "correct" approach though... >>> >>> >>> >>> On 29/09/10 17:37, Jeff Eastman wrote: >>> >>>> Hi Derek, >>>> >>>> I've committed some changes which will hopefully help in fixing this >>>> problem but which do not yet accomplish that. As you can see from the new >>>> CDbw test (testAlmostSameValueCluster) I tried creating a test cluster with >>>> points identical to the cluster center but with one which differed from it >>>> by Double.MIN_NORMAL in one element. That test failed to duplicate your >>>> issue. >>>> >>>> The patch also factors out the std calculation into an implementor of >>>> GaussianAccumulator. I factored the current std calculations out of >>>> CDbwEvaluator into RunningSumsGaussianAccumulator and all the tests >>>> produced >>>> the same results as before. With the new OnlineGaussianAccumulator plugged >>>> in, the tests all return slightly different results but still no NaNs. >>>> >>>> I still have not added priors and I'm not entirely sure where to do >>>> that. I've committed the changes so you can see my quandary. >>>> OnlineGaussianAccumulator is still a work in progress but, since it is >>>> never >>>> used it is in the commit for your viewing. >>>> >>>> Jeff >>>> >>>> On 9/29/10 11:13 AM, Derek O'Callaghan wrote: >>>> >>>>> Thanks Jeff, I'll try out the changes when they're committed. I tried a >>>>> couple of things locally (removing the clusters/setting a small prior), >>>>> but >>>>> I ended up with inter-density > intra-density, so I suspect I've slipped >>>>> up >>>>> somewhere. I'll hold off on it for now. >>>>> >>>>> On 29/09/10 13:48, Jeff Eastman wrote: >>>>> >>>>>> Hi Derek, >>>>>> >>>>>> That makes sense. With the very, very tight cluster that your >>>>>> clustering produced you've uncovered an instability in that std >>>>>> calculation. >>>>>> I'm going to rework that method today to use a better algorithm and will >>>>>> add >>>>>> a small prior in the process. I'm also going to add a unit test to >>>>>> reproduce >>>>>> this problem first. Look for a commit in a couple of hours. >>>>>> >>>>>> >>>>>> >>>>>> On 9/29/10 8:02 AM, Derek O'Callaghan wrote: >>>>>> >>>>>>> Hi Jeff, >>>>>>> >>>>>>> FYI I checked the problem I was having in CDbwEvaluator with the same >>>>>>> dataset from the ClusterEvaluator thread, the problem is occurring in >>>>>>> the >>>>>>> std calculation in CDbwEvaluator.computeStd(), in that >>>>>>> s2.times(s0).minus(s1.times(s1)) generates negative values which then >>>>>>> produce NaN with the subsequent SquareRootFunction(). This then sets the >>>>>>> average std to NaN later on in intraClusterDensity(). It's happening >>>>>>> for the >>>>>>> cluster I have with the almost-identical points. >>>>>>> >>>>>>> It's the same symptom as the problem last week, where this was >>>>>>> happening when s0 was 1. Is the solution to ignore these clusters, like >>>>>>> the >>>>>>> s0 = 1 clusters? Or to add a small prior std as was done for the similar >>>>>>> issue in NormalModel.pdf()? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Derek >>>>>>> >>>>>>> On 28/09/10 20:28, Jeff Eastman wrote: >>>>>>> >>>>>>>> Hi Ted, >>>>>>>> >>>>>>>> The clustering code computes this value for cluster radius. >>>>>>>> Currently, it is done with a running sums approach (s^0, s^1, s^2) that >>>>>>>> computes the std of each vector term using: >>>>>>>> >>>>>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new >>>>>>>> SquareRootFunction()).divide(s0); >>>>>>>> >>>>>>>> For CDbw, they need a scalar, average std value, and this is >>>>>>>> currently computed by averaging the vector terms: >>>>>>>> >>>>>>>> double d = std.zSum() / std.size(); >>>>>>>> >>>>>>>> The more I read about it; however, the less confident I am about >>>>>>>> this approach. The paper itself seems to indicate a covariance >>>>>>>> approach, but >>>>>>>> I am lost in their notation. See page 5, just above Definition 1. >>>>>>>> >>>>>>>> >>>>>>>> www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>
