Hi, I would like to propose for adding new way of computing the percentile without needing to store most of input data. Since this is my first time on contributing to apache; please help me / correct me if i miss any procedure here.
Here are the details. Description: The Percentile calculation in a traditional way require all the data points to be stored and sorted before accesiing the pth Percentile value of the data set. However the storage of points can become prohibitive when we need to make use of the existing Percentile Implementation at big data scale(For eg: when computing the daily or weekly percentile value of a certain performance metric where the data points accumulated over day and week may run to GB and TB). While platforms such as hadoop exist to solve the data scale issue; the need for a statistical computation of quantiles without storing data is an absolute essential. While looking in commons-math classes though Percentile class is available it is implemented with storage of input as requirement. So was wondering if we could add a class to calculate Percentile without needing to store data. The algorithm that i have chosen to implement and propose is based on P Square algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires a minimal and finite set of memory stores to compute percentiles for continuous stream of data. Ref: http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing representation of the workflow of the algorithm Advantages: a) As is claimed in the orignal workd the accuracy improves over moderate to large data sets which is the need. b) A minimal and constant sized data store used to compute a large data set c) Useful in Hadoop Map reduce applications Implementation: I have implemented this algorithm based on StorelessUnivariateStatistic after checking out from 3.2 branch. I have also opened a JIRA ticket on the same (https://issues.apache.org/jira/browse/MATH-1112 ) for requesting a new feature to be added. Please let me know when and how i could send my code for review. thanks murthy