[ 
https://issues.apache.org/jira/browse/PIG-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637585#comment-13637585
 ] 

Vicki Fu commented on PIG-3221:
-------------------------------

Hi I am willing to close this ticket in GSoC13.
I already go through Pig Sample Code. Currently Pig has two sampling: 
RandomSampleLoader and SampleLoader. RandomSampleLoader is the basic sampling 
method to allocate a buffer for numSamples,and scan input and insert with 
random number position tuple. PoissonSampleLoader is using poisson cumulative 
distribution function to predict the probability that a partition has less than 
or equal to k samples.  

Bootstrapping is a method for deriving robust estimates of standard errors and 
confidence intervals for estimates such as the mean, median, proportion, odds 
ratio, correlation coefficient or regression coefficient. So it will keep 
statistic information during sampling.

Algorithm For BootStrapping Sampling:
1. Construct an empirical probability distribution 1/n, the sample, which is 
the nonparametric maximum likelihood estimate of
the population distribution, w.
2. draw a random sample of size n with replacement. This is a ‘resample’.
3. Calculate the statistic of interest L.
4. Repeat 2 and 3 more than n times.

For BootStrap Sampling, current we can use R or Python script directly. But 
that is not big data solution, also it depends on R and related packages 
available. 
Implementation for BootStrap Sampling
1. Add the parameters to support new Sampling.
2. BootStrapSampleLoader will extend RandomSampleLoader
3. Several statistic information will be collected: STDDEV, AVE, confidential 
Interval and so on.

My plan is implement Bootstrap Sampling, Stratified Sampling and Reservoir 
Sampling. (I am not sure all can be finished in Gsoc timeframe, I still can 
work on it after summer time)


Thanks 
Yu Fu
PhD student in UMBC.

Reference
Davison, A. C., and D. V. Hinkley. 2006. Bootstrap Methods and their 
Application. : Cambridge University Press.
Shao, J., and D. Tu. 1995. The Jackknife and Bootstrap. New York: Springer.
MAD Skills: New Analysis Practices for Big Data  
http://db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf
On the Choice of m in the m out of n Bootstrap and its Application to 
Confidence Bounds for Extreme Percentiles 
http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf 
http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf
                
> Bootstrap sampling
> ------------------
>
>                 Key: PIG-3221
>                 URL: https://issues.apache.org/jira/browse/PIG-3221
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>              Labels: gsoc2013
>
> Implement a bootstrap sampling option ( 
> http://en.wikipedia.org/wiki/Bootstrap_(statistics) ) in Pig's SAMPLE 
> operator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to