[ 
https://issues.apache.org/jira/browse/MAHOUT-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983118#action_12983118
 ] 

Sean Owen commented on MAHOUT-533:
----------------------------------

The patch applies and passes. The idea seems reasonable to me. Any objection to 
committing?

> Clustering Standard Deviation Calculations Are Inaccurate
> ---------------------------------------------------------
>
>                 Key: MAHOUT-533
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-533
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: MAHOUT-533.patch
>
>
> Mahout has two classes that compute Gaussian statistics: 
> RunningSumsGaussianAccumulator and OnlineGaussianAccumulator. The first uses 
> sum-of-squares and the second Welford's method. There is also a unit test 
> (TestGaussianAccumulators) which compares their results over a sample dataset 
> and illustrates the large differences in standard deviation produced. The 
> Online accumulator is used in the CDbwEvaluator to compute its metrics. The 
> RunningSums accumulator is only used by the unit test for comparison purposes.
> Today, the sum-of-squares method is used in AbstractCluster to compute mean 
> and stdDev statistics for all Clusters. The stdDev values are not used by 
> most of the clustering algorithms except for graphical displays so this does 
> not cause an accuracy problem with the clustering results themselves. For 
> Dirichlet process clustering; however, stdDev is relevant in computing pdf() 
> and so it needs to be changed in those models. Even with this numerical 
> error; however, Dirichlet performs pretty well. This is probably due to its 
> sampling behavior not requiring precise standard deviations.
> Moving the AbstractCluster implementation to use OnlineGaussianAccumulator is 
> in my plans for 0.5; however, Fuzzy K-Means requires that weighted 
> observations be correctly handled and both K-Means algorithms require that 
> observation statistics (see ClusterObservations) be passed from mapper to 
> combiner to reducer and I have not been able to figure out how to do this yet 
> with Online's state variables as opposed to RunningSums. There is also a 
> performance difference between the two algorithms, since Online does a 
> complete computation for each observe() and none in compute() whereas 
> RunningSums has minimal per-observation math and does all the heavy lifting 
> in compute().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to