Murthy, I recently developed an alternative algorithm which provides superior accuracy for extreme quantiles. You can read more at
https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf?raw=true The library involved is available via maven and is apache licensed. Apache Commons Math has a "no dependency" policy which might mean that sucking in the code would be a better option than simply linking to this. The standard of the art before t-digest is generally considered to be either Greenwald and Khanna's algorithm GK01 or the Q-digest. References are in the paper above. In case it isn't obvious, source code is available on github at https://github.com/tdunning/t-digest The p^2 algorithm that you suggest is actually quite old and far from the state of the art. On Sat, Mar 22, 2014 at 2:40 PM, Phil Steitz <[email protected]> wrote: > On 3/22/14, 2:11 PM, venkatesha m wrote: > > Hi, > > > > I would like to propose for adding new way of computing the percentile > without needing to store most of input data. Since this is my first time > on contributing to apache; please help me / correct me if i miss any > procedure here. > > > > Here are the details. > > > > Description: > > The Percentile calculation in a traditional way require all the data > points to be stored and sorted before accesiing the pth Percentile value of > the data set. However the storage of points can become prohibitive when we > need to make use of the existing Percentile Implementation at big data > scale(For eg: when computing the daily or weekly percentile value of a > certain performance metric where the data points accumulated over day and > week may run to GB and TB). While platforms such as hadoop exist to solve > the data scale issue; the need for a statistical computation of quantiles > without storing data is an absolute essential. > > While looking in commons-math classes though Percentile class is > available it is implemented with storage of input as requirement. So was > wondering if we could add a class to calculate Percentile without needing > to store data. The algorithm that i have chosen to implement and propose > is based on P Square algorithm ( > http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires a > minimal and finite set of memory stores to compute percentiles for > continuous stream of data. > > > > Ref: > > http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing > representation of the workflow of the algorithm > > > > Advantages: > > a) As is claimed in the orignal workd the accuracy improves over > moderate to large data sets which is the need. > > b) A minimal and constant sized data store used to compute a large data > set > > c) Useful in Hadoop Map reduce applications > > > > Implementation: > > I have implemented this algorithm based on StorelessUnivariateStatistic > after checking out from 3.2 branch. I have also opened a JIRA ticket on > the same (https://issues.apache.org/jira/browse/MATH-1112 ) for > requesting a new feature to be added. > > > > Please let me know when and how i could send my code for review. > > Thanks, Murthy and welcome! > > I am personally fine with the P-Square algorithm and would welcome a > patch including implementation. Unless others disagree with this > approach (give it a day or two), I would go ahead and attach a patch > with the implementation to the JIRA you opened. > > Thanks in advance for your contributions! > > Phil > > > > thanks > > murthy > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
