Hi,

I would like to propose for adding new way of computing the percentile without 
needing to store most of input data.  Since this is my first time on 
contributing to apache; please help me / correct me if i miss any procedure 
here.

Here are the details.

Description: 
The Percentile calculation in a  traditional way require all the data points to 
be stored and sorted before accesiing the pth Percentile value of the data set. 
However the storage of points can become prohibitive when we need to make use 
of the existing Percentile Implementation at big data scale(For eg: when 
computing the daily or weekly percentile value of a certain performance metric 
where the data points accumulated over day and week may run to GB and TB).  
While platforms such as hadoop exist to solve the data scale issue; the need 
for a statistical computation of quantiles without storing data is an absolute 
essential.
While looking in commons-math classes though Percentile class is available it 
is implemented with storage of input as requirement. So was wondering if we 
could add a class to calculate Percentile without needing to store data.  The 
algorithm that i have chosen to implement and propose is based on P Square 
algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which requires 
a minimal and finite set of memory stores to compute percentiles for continuous 
stream of data.

Ref: 
http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing 
representation of the workflow of the algorithm

Advantages: 
a) As is claimed in the orignal workd the accuracy improves over moderate to 
large data sets which is the need.
b) A minimal and constant sized data store used to compute a large data set 
c) Useful in Hadoop Map reduce applications

Implementation:
I have implemented this algorithm based on StorelessUnivariateStatistic after 
checking out from 3.2 branch.  I have also opened a JIRA ticket on the same 
(https://issues.apache.org/jira/browse/MATH-1112 ) for requesting a new feature 
to be added. 

Please let me know when and how i could send my code for review.

thanks
murthy

Reply via email to