Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Ted Dunning Sat, 22 Mar 2014 15:32:20 -0700

Murthy,

I recently developed an alternative algorithm which provides superior
accuracy for extreme quantiles.  You can read more at


https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true

The library involved is available via maven and is apache licensed.  Apache
Commons Math has a "no dependency" policy which might mean that sucking in
the code would be a better option than simply linking to this.

The standard of the art before t-digest is generally considered to be
either Greenwald and Khanna's algorithm GK01 or the Q-digest.  References
are in the paper above.

In case it isn't obvious, source code is available on github at

https://github.com/tdunning/t-digest

The p^2 algorithm that you suggest is actually quite old and far from the
state of the art.




On Sat, Mar 22, 2014 at 2:40 PM, Phil Steitz <[email protected]> wrote:

> On 3/22/14, 2:11 PM, venkatesha m wrote:
> > Hi,
> >
> > I would like to propose for adding new way of computing the percentile
> without needing to store most of input data.  Since this is my first time
> on contributing to apache; please help me / correct me if i miss any
> procedure here.
> >
> > Here are the details.
> >
> > Description:
> > The Percentile calculation in a  traditional way require all the data
> points to be stored and sorted before accesiing the pth Percentile value of
> the data set. However the storage of points can become prohibitive when we
> need to make use of the existing Percentile Implementation at big data
> scale(For eg: when computing the daily or weekly percentile value of a
> certain performance metric where the data points accumulated over day and
> week may run to GB and TB).  While platforms such as hadoop exist to solve
> the data scale issue; the need for a statistical computation of quantiles
> without storing data is an absolute essential.
> > While looking in commons-math classes though Percentile class is
> available it is implemented with storage of input as requirement. So was
> wondering if we could add a class to calculate Percentile without needing
> to store data.  The algorithm that i have chosen to implement and propose
> is based on P Square algorithm (
> http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires a
> minimal and finite set of memory stores to compute percentiles for
> continuous stream of data.
> >
> > Ref:
> > http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing
> representation of the workflow of the algorithm
> >
> > Advantages:
> > a) As is claimed in the orignal workd the accuracy improves over
> moderate to large data sets which is the need.
> > b) A minimal and constant sized data store used to compute a large data
> set
> > c) Useful in Hadoop Map reduce applications
> >
> > Implementation:
> > I have implemented this algorithm based on StorelessUnivariateStatistic
> after checking out from 3.2 branch.  I have also opened a JIRA ticket on
> the same (https://issues.apache.org/jira/browse/MATH-1112 ) for
> requesting a new feature to be added.
> >
> > Please let me know when and how i could send my code for review.
>
> Thanks, Murthy and welcome!
>
> I am personally fine with the P-Square algorithm and would welcome a
> patch including implementation.  Unless others disagree with this
> approach (give it a day or two), I would go ahead and attach a patch
> with the implementation to the JIRA you opened.
>
> Thanks in advance for your contributions!
>
> Phil
> >
> > thanks
> > murthy
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Reply via email to