[
https://issues.apache.org/jira/browse/MAHOUT-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828498#comment-13828498
]
Ted Dunning commented on MAHOUT-1361:
-------------------------------------
DL,
I can't comment specifically (which is part of why I implemented this). My
impression is that Q-digests dominate count min sketches in terms of memory /
accuracy trade-off, but I am not at all sure. I am absolutely certain that
t-digests are better than Q-digests in terms of memory / accuracy for extreme
quantiles since the errors will be measured in a few ppm instead of a few
percent. If you used a flat centroid size limit to compare against Q-digests,
the memory consumption should be quite similar. In any case, t-digests are
much simpler conceptually and do not have the difficulty that Q-digests have
that the proof of accuracy is based on the batch mode rather than the on-line
case.
> Online algorithm for computing accurate Quantiles using 1-D clustering
> ----------------------------------------------------------------------
>
> Key: MAHOUT-1361
> URL: https://issues.apache.org/jira/browse/MAHOUT-1361
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Affects Versions: 0.9
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 0.9
>
> Attachments: MAHOUT-1361.patch
>
>
> Implementation of Ted Dunning's paper and initial work on this subject. See
> https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf
> for the paper.
> An on-line algorithm for computing approximations of rank-based statistics
> that allows controllable accuracy. This algorithm can also be used to compute
> hybrid statistics such as trimmed means in addition to computing arbitrary
> quantiles.
--
This message was sent by Atlassian JIRA
(v6.1#6144)