[ 
https://issues.apache.org/jira/browse/MAHOUT-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Eastman updated MAHOUT-533:
--------------------------------

    Attachment: MAHOUT-533.patch

Here's a patch that implements a weighted Welford Online accumulator and uses 
AbstractCluster's weighted RunningSums that originally came from Fuzzy K-Means. 
The test computes gold standard values for mean and std using the two-pass 
method to compare results. As expected the Online accumulator does much better 
than the RunningSums at computing std. The tests then compare two weighted 
observations using both accumulators. In both tests, the mean values produced 
are within EPSILON of each other but the std values vary by 0.001.

> Clustering Standard Deviation Calculations Are Inaccurate
> ---------------------------------------------------------
>
>                 Key: MAHOUT-533
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-533
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: MAHOUT-533.patch
>
>
> Mahout has two classes that compute Gaussian statistics: 
> RunningSumsGaussianAccumulator and OnlineGaussianAccumulator. The first uses 
> sum-of-squares and the second Welford's method. There is also a unit test 
> (TestGaussianAccumulators) which compares their results over a sample dataset 
> and illustrates the large differences in standard deviation produced. The 
> Online accumulator is used in the CDbwEvaluator to compute its metrics. The 
> RunningSums accumulator is only used by the unit test for comparison purposes.
> Today, the sum-of-squares method is used in AbstractCluster to compute mean 
> and stdDev statistics for all Clusters. The stdDev values are not used by 
> most of the clustering algorithms except for graphical displays so this does 
> not cause an accuracy problem with the clustering results themselves. For 
> Dirichlet process clustering; however, stdDev is relevant in computing pdf() 
> and so it needs to be changed in those models. Even with this numerical 
> error; however, Dirichlet performs pretty well. This is probably due to its 
> sampling behavior not requiring precise standard deviations.
> Moving the AbstractCluster implementation to use OnlineGaussianAccumulator is 
> in my plans for 0.5; however, Fuzzy K-Means requires that weighted 
> observations be correctly handled and both K-Means algorithms require that 
> observation statistics (see ClusterObservations) be passed from mapper to 
> combiner to reducer and I have not been able to figure out how to do this yet 
> with Online's state variables as opposed to RunningSums. There is also a 
> performance difference between the two algorithms, since Online does a 
> complete computation for each observe() and none in compute() whereas 
> RunningSums has minimal per-observation math and does all the heavy lifting 
> in compute().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to