[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634622#comment-14634622
 ] 

Chengxiang Li edited comment on FLINK-1901 at 7/22/15 2:05 AM:
---------------------------------------------------------------

To randomly choose a sample from a DataSet S, basically, there exists two kinds 
of sample requirement: sampling with fraction(such as "randomly choose 5% 
percent items in S") and sampling with fixed size(such as "randomly choose 100 
items from S"). Besides, we do not know the size of S, unless we take extra 
cost to computer it through DataSet::count().
# Sampling with fraction
#* With replacement: the expected sample size follow [Poisson 
Distribution|https://en.wikipedia.org/wiki/Poisson_distribution] in this case, 
so Poisson Sampling can be used to choose the sample items.
#* Without replacement: during sampling, we can take the sample of each item in 
iterator as a [Bernoulli Trial|https://en.wikipedia.org/wiki/Bernoulli_trial].
# Sampling with fixed size
#* Use DataSet::count() to get the dataset size, with the fixed size, we can 
turn this into sampling with factor.
#* [Reservoir Sampling|https://en.wikipedia.org/wiki/Reservoir_sampling] is 
another commonly used algorithm to randomly choose a sample of k items from a 
list S containing n items, where n is either a very large or unknown number, 
and there are different reservoir sampling algorithms that support reservoir 
support both sampling with replacement and sampling without replacement.



was (Author: chengxiang li):
To randomly choose a sample from a DataSet S, basically, there exists two kinds 
of sample requirement: sampling with factor(such as "randomly choose 5% percent 
items in S") and sampling with fixed size(such as "randomly choose 100 items 
from S"). Besides, we do not know the size of S, unless we take extra cost to 
computer it through DataSet::count().
# Sampling with factor
#* With replacement: the expected sample size follow [Poisson 
Distribution|https://en.wikipedia.org/wiki/Poisson_distribution] in this case, 
so Poisson Sampling can be used to choose the sample items.
#* Without replacement: during sampling, we can take the sample of each item in 
iterator as a [Bernoulli Trial|https://en.wikipedia.org/wiki/Bernoulli_trial].
# Sampling with fixed size
#* Use DataSet::count() to get the dataset size, with the fixed size, we can 
turn this into sampling with factor.
#* [Reservoir Sampling|https://en.wikipedia.org/wiki/Reservoir_sampling] is 
another commonly used algorithm to randomly choose a sample of k items from a 
list S containing n items, where n is either a very large or unknown number, 
and there are different reservoir sampling algorithms that support reservoir 
support both sampling with replacement and sampling without replacement.


> Create sample operator for Dataset
> ----------------------------------
>
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of 
> other machine learning algorithms we need to have a way to take a random 
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, 
> choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to