[ 
https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975026#comment-13975026
 ] 

Ted Dunning commented on MATH-418:
----------------------------------

Murthy,

I think that you didn't get the point.

All of these algorithms (except those retaining all of the dat) are 
approximate.  That isn't news, nor does it differentiate the algorithms.

The questions are:

a) is only a single quantile computed and is that quantile flexible?

b) how accurate is the quantile calculation?

c) how does the accuracy vary as a function of quantile, especially near 0 or 1?

d) how long does it take to insert a new point?

My assertion is that t-digest dominates all of these other options on all of 
these questions except possibly for (d) where it roughly matches them.  I am 
not sure I see the point of implementing these other algorithms if they provide 
no advantage.

To be specific:

GK - is relatively fast and allows computation of multiple quantiles.  Memory 
usage for a large number of quantiles is large, accuracy is less than t-digest 
controlling for memory usage.  Accuracy at quantiles near 0 or 1 is several 
orders of magnitude worse.  GK is not particularly susceptible to skewed 
distributions.

Q-digest - only works on an integer domain which makes it not work well at all 
for skewed distributions.  It has comparable accuracy as GK when quantization 
is not a factor.  Memory usage is quite comparable to t-digest, but accuracy is 
noticeably worse and near q=0 or 1, accuracy is several orders of magnitude 
worse.

BinMedian and BinApprox - only computes the median, fails entirely for skewed 
distributions where the median is outside the range mean - sd to mean + sd.  
Accuracy is comparable to other algorithms but since no quantiles other than 
median are computed it doesn't even compete.


> add a storeless version of Percentile
> -------------------------------------
>
>                 Key: MATH-418
>                 URL: https://issues.apache.org/jira/browse/MATH-418
>             Project: Commons Math
>          Issue Type: New Feature
>    Affects Versions: 2.1
>            Reporter: Luc Maisonobe
>             Fix For: 4.0
>
>         Attachments: patch
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as 
> a storeless statistic.
> An example of such an algorithm is the exponentially weighted stochastic 
> approximation  described in a 2000 paper by Fei Chen ,  Diane Lambert  and 
> José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which 
> can be retrieved from CiteSeerX at 
> [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to