[
https://issues.apache.org/jira/browse/MAHOUT-533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983249#action_12983249
]
Ted Dunning commented on MAHOUT-533:
------------------------------------
I thought this had long since been committed. Go for it.
> Clustering Standard Deviation Calculations Are Inaccurate
> ---------------------------------------------------------
>
> Key: MAHOUT-533
> URL: https://issues.apache.org/jira/browse/MAHOUT-533
> Project: Mahout
> Issue Type: Improvement
> Affects Versions: 0.4
> Reporter: Jeff Eastman
> Fix For: 0.5
>
> Attachments: MAHOUT-533.patch
>
>
> Mahout has two classes that compute Gaussian statistics:
> RunningSumsGaussianAccumulator and OnlineGaussianAccumulator. The first uses
> sum-of-squares and the second Welford's method. There is also a unit test
> (TestGaussianAccumulators) which compares their results over a sample dataset
> and illustrates the large differences in standard deviation produced. The
> Online accumulator is used in the CDbwEvaluator to compute its metrics. The
> RunningSums accumulator is only used by the unit test for comparison purposes.
> Today, the sum-of-squares method is used in AbstractCluster to compute mean
> and stdDev statistics for all Clusters. The stdDev values are not used by
> most of the clustering algorithms except for graphical displays so this does
> not cause an accuracy problem with the clustering results themselves. For
> Dirichlet process clustering; however, stdDev is relevant in computing pdf()
> and so it needs to be changed in those models. Even with this numerical
> error; however, Dirichlet performs pretty well. This is probably due to its
> sampling behavior not requiring precise standard deviations.
> Moving the AbstractCluster implementation to use OnlineGaussianAccumulator is
> in my plans for 0.5; however, Fuzzy K-Means requires that weighted
> observations be correctly handled and both K-Means algorithms require that
> observation statistics (see ClusterObservations) be passed from mapper to
> combiner to reducer and I have not been able to figure out how to do this yet
> with Online's state variables as opposed to RunningSums. There is also a
> performance difference between the two algorithms, since Online does a
> complete computation for each observe() and none in compute() whereas
> RunningSums has minimal per-observation math and does all the heavy lifting
> in compute().
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.