[ 
https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974950#comment-13974950
 ] 

Ted Dunning commented on MATH-418:
----------------------------------

I just read the proposed code.

The tests are very good when it comes to checking for boundary conditions and 
illegal input, but they do not test several functional aspects.

In particular, it would be good to have tests for the following conditions:

a) extreme quantiles such as 99.999%-ile or 0.00001 %-ile

b) highly skewed input distributions such as Gamma(0.1, 0.1), especially with 
low quantiles like 0.1%-ile

This algorithm is essentially identical to the one that Mahout used to use.  We 
dropped it in favor of t-digest because it was only able to compute one 
percentile per set of markers and because accuracy for skewed distributions was 
relatively marginal.

The t-digest and Greenwald Khanna algorithms (let's call it GK01) both handle 
skewed distributions much more accurately than the Chen, Lambert, Pinheiro 
(let's call it CLP) algorithm.  A particular virtue of GK01 and t-digest is 
that they allow any quantile to be computed without having to pre-specify which 
quantiles you want.  It is common to need many quantiles from the same set of 
data and some applications require arbitrary quantiles.  GK01 and t-digest also 
allow inverse CDF calculation (where you get to find out the quantile value for 
a particular x).  GK01 is the algorithm used by Sawzall and (I think) Dremel.  
The t-digest algorithm has been adopted by Mahout, Streamlib and Elastic Search.

The particular virtue of t-digest over GK01 is that it is several orders of 
magnitude more accurate for extreme quantiles.  There are three well-tested 
implementations of t-digest available in both source and as a maven artifact.  
All the code is ASL 2 licensed.

See https://github.com/tdunning/t-digest for more information.


> add a storeless version of Percentile
> -------------------------------------
>
>                 Key: MATH-418
>                 URL: https://issues.apache.org/jira/browse/MATH-418
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 2.1
>            Reporter: Luc Maisonobe
>             Fix For: 4.0
>
>         Attachments: patch
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as 
> a storeless statistic.
> An example of such an algorithm is the exponentially weighted stochastic 
> approximation  described in a 2000 paper by Fei Chen ,  Diane Lambert  and 
> José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which 
> can be retrieved from CiteSeerX at 
> [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to