Clustering Standard Deviation Calculations Are Inaccurate
---------------------------------------------------------

                 Key: MAHOUT-533
                 URL: https://issues.apache.org/jira/browse/MAHOUT-533
             Project: Mahout
          Issue Type: Improvement
    Affects Versions: 0.4
            Reporter: Jeff Eastman
             Fix For: 0.5


Mahout has two classes that compute Gaussian statistics: 
RunningSumsGaussianAccumulator and OnlineGaussianAccumulator. The first uses 
sum-of-squares and the second Welford's method. There is also a unit test 
(TestGaussianAccumulators) which compares their results over a sample dataset 
and illustrates the large differences in standard deviation produced. The 
Online accumulator is used in the CDbwEvaluator to compute its metrics. The 
RunningSums accumulator is only used by the unit test for comparison purposes.

Today, the sum-of-squares method is used in AbstractCluster to compute mean and 
stdDev statistics for all Clusters. The stdDev values are not used by most of 
the clustering algorithms except for graphical displays so this does not cause 
an accuracy problem with the clustering results themselves. For Dirichlet 
process clustering; however, stdDev is relevant in computing pdf() and so it 
needs to be changed in those models. Even with this numerical error; however, 
Dirichlet performs pretty well. This is probably due to its sampling behavior 
not requiring precise standard deviations.

Moving the AbstractCluster implementation to use OnlineGaussianAccumulator is 
in my plans for 0.5; however, Fuzzy K-Means requires that weighted observations 
be correctly handled and both K-Means algorithms require that observation 
statistics (see ClusterObservations) be passed from mapper to combiner to 
reducer and I have not been able to figure out how to do this yet with Online's 
state variables as opposed to RunningSums. There is also a performance 
difference between the two algorithms, since Online does a complete computation 
for each observation and none in computeParameters() whereas RunningSums has 
minimal per-observation math and does the heavy lifting in computeParameters().


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to