[ 
https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944470#comment-13944470
 ] 

Ted Dunning commented on MATH-418:
----------------------------------

The t-digest is also available.  

You can read more at

https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true

This paper contains accuracy comparisons against the DataFu StreamingQuantile 
implementation of GK01 and against the Q-digest algorithm.  T-digest dominates 
both in terms of space and accuracy.  I haven't tested for speed yet since the 
t-digest implementation is improving rapidly.  T-digest is also a very simple 
algorithm conceptually which helps with accurate implementation.  T-digest is 
particularly good at maintaining accuracy for extreme quantiles and for highly 
skewed distribution.  GK01 does well on skewed distributions, but maintains 
constant absolute accuracy rather than relative accuracy.  Q-digest depends on 
integer representations and thus fails badly on sufficiently skewed 
distributions.  

The library involved is available via maven and is apache licensed.  Apache 
Commons Math has a "no dependency" policy which might mean that sucking in the 
code would be a better option than simply linking to this.

The standard of the art before t-digest is generally considered to be either 
Greenwald and Khanna's algorithm GK01 (what DataFu uses) or the Q-digest (what 
stream-lib used before they adopted t-digest).  References are in the paper 
above.

In case it isn't obvious, source code is available on github at

https://github.com/tdunning/t-digest

The p^2 algorithm is actually quite old and is far from the state of the art.

> add a storeless version of Percentile
> -------------------------------------
>
>                 Key: MATH-418
>                 URL: https://issues.apache.org/jira/browse/MATH-418
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 2.1
>            Reporter: Luc Maisonobe
>             Fix For: 4.0
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as 
> a storeless statistic.
> An example of such an algorithm is the exponentially weighted stochastic 
> approximation  described in a 2000 paper by Fei Chen ,  Diane Lambert  and 
> José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which 
> can be retrieved from CiteSeerX at 
> [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to