[
https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944470#comment-13944470
]
Ted Dunning commented on MATH-418:
----------------------------------
The t-digest is also available.
You can read more at
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true
This paper contains accuracy comparisons against the DataFu StreamingQuantile
implementation of GK01 and against the Q-digest algorithm. T-digest dominates
both in terms of space and accuracy. I haven't tested for speed yet since the
t-digest implementation is improving rapidly. T-digest is also a very simple
algorithm conceptually which helps with accurate implementation. T-digest is
particularly good at maintaining accuracy for extreme quantiles and for highly
skewed distribution. GK01 does well on skewed distributions, but maintains
constant absolute accuracy rather than relative accuracy. Q-digest depends on
integer representations and thus fails badly on sufficiently skewed
distributions.
The library involved is available via maven and is apache licensed. Apache
Commons Math has a "no dependency" policy which might mean that sucking in the
code would be a better option than simply linking to this.
The standard of the art before t-digest is generally considered to be either
Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what
stream-lib used before they adopted t-digest). References are in the paper
above.
In case it isn't obvious, source code is available on github at
https://github.com/tdunning/t-digest
The p^2 algorithm is actually quite old and is far from the state of the art.
> add a storeless version of Percentile
> -------------------------------------
>
> Key: MATH-418
> URL: https://issues.apache.org/jira/browse/MATH-418
> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 2.1
> Reporter: Luc Maisonobe
> Fix For: 4.0
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as
> a storeless statistic.
> An example of such an algorithm is the exponentially weighted stochastic
> approximation described in a 2000 paper by Fei Chen , Diane Lambert and
> José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which
> can be retrieved from CiteSeerX at
> [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].
--
This message was sent by Atlassian JIRA
(v6.2#6252)