On 3/22/14, 2:11 PM, venkatesha m wrote: > Hi, > > I would like to propose for adding new way of computing the percentile > without needing to store most of input data. Since this is my first time on > contributing to apache; please help me / correct me if i miss any procedure > here. > > Here are the details. > > Description: > The Percentile calculation in a traditional way require all the data points > to be stored and sorted before accesiing the pth Percentile value of the data > set. However the storage of points can become prohibitive when we need to > make use of the existing Percentile Implementation at big data scale(For eg: > when computing the daily or weekly percentile value of a certain performance > metric where the data points accumulated over day and week may run to GB and > TB). While platforms such as hadoop exist to solve the data scale issue; the > need for a statistical computation of quantiles without storing data is an > absolute essential. > While looking in commons-math classes though Percentile class is available it > is implemented with storage of input as requirement. So was wondering if we > could add a class to calculate Percentile without needing to store data. The > algorithm that i have chosen to implement and propose is based on P Square > algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which > requires a minimal and finite set of memory stores to compute percentiles for > continuous stream of data. > > Ref: > http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing > representation of the workflow of the algorithm > > Advantages: > a) As is claimed in the orignal workd the accuracy improves over moderate to > large data sets which is the need. > b) A minimal and constant sized data store used to compute a large data set > c) Useful in Hadoop Map reduce applications > > Implementation: > I have implemented this algorithm based on StorelessUnivariateStatistic after > checking out from 3.2 branch. I have also opened a JIRA ticket on the same > (https://issues.apache.org/jira/browse/MATH-1112 ) for requesting a new > feature to be added. > > Please let me know when and how i could send my code for review.
Thanks, Murthy and welcome! I am personally fine with the P-Square algorithm and would welcome a patch including implementation. Unless others disagree with this approach (give it a day or two), I would go ahead and attach a patch with the implementation to the JIRA you opened. Thanks in advance for your contributions! Phil > > thanks > murthy --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org