[
https://issues.apache.org/jira/browse/PIG-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637585#comment-13637585
]
Vicki Fu commented on PIG-3221:
-------------------------------
Hi I am willing to close this ticket in GSoC13.
I already go through Pig Sample Code. Currently Pig has two sampling:
RandomSampleLoader and SampleLoader. RandomSampleLoader is the basic sampling
method to allocate a buffer for numSamples,and scan input and insert with
random number position tuple. PoissonSampleLoader is using poisson cumulative
distribution function to predict the probability that a partition has less than
or equal to k samples.
Bootstrapping is a method for deriving robust estimates of standard errors and
confidence intervals for estimates such as the mean, median, proportion, odds
ratio, correlation coefficient or regression coefficient. So it will keep
statistic information during sampling.
Algorithm For BootStrapping Sampling:
1. Construct an empirical probability distribution 1/n, the sample, which is
the nonparametric maximum likelihood estimate of
the population distribution, w.
2. draw a random sample of size n with replacement. This is a ‘resample’.
3. Calculate the statistic of interest L.
4. Repeat 2 and 3 more than n times.
For BootStrap Sampling, current we can use R or Python script directly. But
that is not big data solution, also it depends on R and related packages
available.
Implementation for BootStrap Sampling
1. Add the parameters to support new Sampling.
2. BootStrapSampleLoader will extend RandomSampleLoader
3. Several statistic information will be collected: STDDEV, AVE, confidential
Interval and so on.
My plan is implement Bootstrap Sampling, Stratified Sampling and Reservoir
Sampling. (I am not sure all can be finished in Gsoc timeframe, I still can
work on it after summer time)
Thanks
Yu Fu
PhD student in UMBC.
Reference
Davison, A. C., and D. V. Hinkley. 2006. Bootstrap Methods and their
Application. : Cambridge University Press.
Shao, J., and D. Tu. 1995. The Jackknife and Bootstrap. New York: Springer.
MAD Skills: New Analysis Practices for Big Data
http://db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf
On the Choice of m in the m out of n Bootstrap and its Application to
Confidence Bounds for Extreme Percentiles
http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf
http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf
> Bootstrap sampling
> ------------------
>
> Key: PIG-3221
> URL: https://issues.apache.org/jira/browse/PIG-3221
> Project: Pig
> Issue Type: New Feature
> Reporter: Gianmarco De Francisci Morales
> Labels: gsoc2013
>
> Implement a bootstrap sampling option (
> http://en.wikipedia.org/wiki/Bootstrap_(statistics) ) in Pig's SAMPLE
> operator.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira