[
https://issues.apache.org/jira/browse/MAHOUT-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828500#comment-13828500
]
Ted Dunning commented on MAHOUT-1361:
-------------------------------------
Otis,
I should put merging into the paper. Omitting it was an oversight.
Yes. It is quite possible. The basic idea is that you concatenate the
sub-digests and compact the result exactly as you do in the sequential points
case where the number of centroids grows too large. The compaction is done by
taking centroids in random order and adding them to a new digest. The slightly
sophistication is that the centroid being added can be added in pieces to
several other centroids if they are the closest ones. This cleverness will
probably have nearly negligible effect, but it will guarantee the invariant
more precisely than if you create a new centroid from the centroid if it has
too much weight to fit on any of the equally near existing centroids.
> Online algorithm for computing accurate Quantiles using 1-D clustering
> ----------------------------------------------------------------------
>
> Key: MAHOUT-1361
> URL: https://issues.apache.org/jira/browse/MAHOUT-1361
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 0.9
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 0.9
>
> Attachments: MAHOUT-1361.patch
>
>
> Implementation of Ted Dunning's paper and initial work on this subject. See
> https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
> for the paper.
> An on-line algorithm for computing approximations of rank-based statistics
> that allows controllable accuracy. This algorithm can also be used to compute
> hybrid statistics such as trimmed means in addition to computing arbitrary
> quantiles.
--
This message was sent by Atlassian JIRA
(v6.1#6144)