[
https://issues.apache.org/jira/browse/MATH-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975026#comment-13975026
]
Ted Dunning commented on MATH-418:
----------------------------------
Murthy,
I think that you didn't get the point.
All of these algorithms (except those retaining all of the dat) are
approximate. That isn't news, nor does it differentiate the algorithms.
The questions are:
a) is only a single quantile computed and is that quantile flexible?
b) how accurate is the quantile calculation?
c) how does the accuracy vary as a function of quantile, especially near 0 or 1?
d) how long does it take to insert a new point?
My assertion is that t-digest dominates all of these other options on all of
these questions except possibly for (d) where it roughly matches them. I am
not sure I see the point of implementing these other algorithms if they provide
no advantage.
To be specific:
GK - is relatively fast and allows computation of multiple quantiles. Memory
usage for a large number of quantiles is large, accuracy is less than t-digest
controlling for memory usage. Accuracy at quantiles near 0 or 1 is several
orders of magnitude worse. GK is not particularly susceptible to skewed
distributions.
Q-digest - only works on an integer domain which makes it not work well at all
for skewed distributions. It has comparable accuracy as GK when quantization
is not a factor. Memory usage is quite comparable to t-digest, but accuracy is
noticeably worse and near q=0 or 1, accuracy is several orders of magnitude
worse.
BinMedian and BinApprox - only computes the median, fails entirely for skewed
distributions where the median is outside the range mean - sd to mean + sd.
Accuracy is comparable to other algorithms but since no quantiles other than
median are computed it doesn't even compete.
> add a storeless version of Percentile
> -------------------------------------
>
> Key: MATH-418
> URL: https://issues.apache.org/jira/browse/MATH-418
> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 2.1
> Reporter: Luc Maisonobe
> Fix For: 4.0
>
> Attachments: patch
>
>
> The Percentile class can handle only in-memory data.
> It would be interesting to use an on-line algorithm to estimate quantiles as
> a storeless statistic.
> An example of such an algorithm is the exponentially weighted stochastic
> approximation described in a 2000 paper by Fei Chen , Diane Lambert and
> José C. Pinheiro "Incremental Quantile Estimation for Massive Tracking" which
> can be retrieved from CiteSeerX at
> [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.1580].
--
This message was sent by Atlassian JIRA
(v6.2#6252)