Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Phil Steitz Sat, 22 Mar 2014 14:41:28 -0700

On 3/22/14, 2:11 PM, venkatesha m wrote:
> Hi,
>
> I would like to propose for adding new way of computing the percentile 
> without needing to store most of input data.  Since this is my first time on 
> contributing to apache; please help me / correct me if i miss any procedure 
> here.
>
> Here are the details.
>
> Description: 
> The Percentile calculation in a  traditional way require all the data points 
> to be stored and sorted before accesiing the pth Percentile value of the data 
> set. However the storage of points can become prohibitive when we need to 
> make use of the existing Percentile Implementation at big data scale(For eg: 
> when computing the daily or weekly percentile value of a certain performance 
> metric where the data points accumulated over day and week may run to GB and 
> TB).  While platforms such as hadoop exist to solve the data scale issue; the 
> need for a statistical computation of quantiles without storing data is an 
> absolute essential.
> While looking in commons-math classes though Percentile class is available it 
> is implemented with storage of input as requirement. So was wondering if we 
> could add a class to calculate Percentile without needing to store data.  The 
> algorithm that i have chosen to implement and propose is based on P Square 
> algorithm ( http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf ) which 
> requires a minimal and finite set of memory stores to compute percentiles for 
> continuous stream of data.
>
> Ref: 
> http://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf which has succing 
> representation of the workflow of the algorithm
>
> Advantages: 
> a) As is claimed in the orignal workd the accuracy improves over moderate to 
> large data sets which is the need.
> b) A minimal and constant sized data store used to compute a large data set 
> c) Useful in Hadoop Map reduce applications
>
> Implementation:
> I have implemented this algorithm based on StorelessUnivariateStatistic after 
> checking out from 3.2 branch.  I have also opened a JIRA ticket on the same 
> (https://issues.apache.org/jira/browse/MATH-1112 ) for requesting a new 
> feature to be added. 
>
> Please let me know when and how i could send my code for review.


Thanks, Murthy and welcome!

I am personally fine with the P-Square algorithm and would welcome a
patch including implementation.  Unless others disagree with this
approach (give it a day or two), I would go ahead and attach a patch
with the implementation to the JIRA you opened.

Thanks in advance for your contributions!

Phil
>
> thanks
> murthy


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math] Proposal for New way of Computing an approximate Percentile without storing input data

Reply via email to