[ 
https://issues.apache.org/jira/browse/MAHOUT-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1368:
----------------------------------

    Comment: was deleted

(was: Ted, we need to hold off on committing this patch until we fix the issue 
with ClusterQualitySummarizer which is broken after applying this patch.  I'll 
look at it tomorrow, its too late in the night now to wrap my head around it.

Running ClusterQualitySummarizer (after applying this patch) on output 
StreamingKMeans (using Reuters dataset) and it throws the following exception:-

{Code}
Average distance in cluster 0 [4]: 18723.469424
Average distance in cluster 1 [1169]: 13974.466645
Average distance in cluster 2 [1932]: 1273.335898
Exception in thread "main" java.lang.IllegalArgumentException
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
        at org.apache.mahout.math.stats.TDigest.quantile(TDigest.java:268)
        at 
org.apache.mahout.math.stats.OnlineSummarizer.getQuartile(OnlineSummarizer.java:83)
        at 
org.apache.mahout.math.stats.OnlineSummarizer.getMax(OnlineSummarizer.java:79)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.printSummaries(ClusterQualitySummarizer.java:74)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.printSummaries(ClusterQualitySummarizer.java:66)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.run(ClusterQualitySummarizer.java:141)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.main(ClusterQualitySummarizer.java:281)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)

{Code})

> Convert OnlineSummarizer to use the new TDigest
> -----------------------------------------------
>
>                 Key: MAHOUT-1368
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1368
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Ted Dunning
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1368.patch
>
>
> The new TDigest provides better accuracy for quartile estimation as well as 
> producing any other quantile you might like.  The current quartile estimation 
> of the OnlineSummarizer fails for highly skewed distributions and can't 
> really be extended to provide other quantiles.  The TDigest handles all of 
> this.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to