[ https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634622#comment-14634622 ]
Chengxiang Li edited comment on FLINK-1901 at 7/22/15 2:05 AM: --------------------------------------------------------------- To randomly choose a sample from a DataSet S, basically, there exists two kinds of sample requirement: sampling with fraction(such as "randomly choose 5% percent items in S") and sampling with fixed size(such as "randomly choose 100 items from S"). Besides, we do not know the size of S, unless we take extra cost to computer it through DataSet::count(). # Sampling with fraction #* With replacement: the expected sample size follow [Poisson Distribution|https://en.wikipedia.org/wiki/Poisson_distribution] in this case, so Poisson Sampling can be used to choose the sample items. #* Without replacement: during sampling, we can take the sample of each item in iterator as a [Bernoulli Trial|https://en.wikipedia.org/wiki/Bernoulli_trial]. # Sampling with fixed size #* Use DataSet::count() to get the dataset size, with the fixed size, we can turn this into sampling with factor. #* [Reservoir Sampling|https://en.wikipedia.org/wiki/Reservoir_sampling] is another commonly used algorithm to randomly choose a sample of k items from a list S containing n items, where n is either a very large or unknown number, and there are different reservoir sampling algorithms that support reservoir support both sampling with replacement and sampling without replacement. was (Author: chengxiang li): To randomly choose a sample from a DataSet S, basically, there exists two kinds of sample requirement: sampling with factor(such as "randomly choose 5% percent items in S") and sampling with fixed size(such as "randomly choose 100 items from S"). Besides, we do not know the size of S, unless we take extra cost to computer it through DataSet::count(). # Sampling with factor #* With replacement: the expected sample size follow [Poisson Distribution|https://en.wikipedia.org/wiki/Poisson_distribution] in this case, so Poisson Sampling can be used to choose the sample items. #* Without replacement: during sampling, we can take the sample of each item in iterator as a [Bernoulli Trial|https://en.wikipedia.org/wiki/Bernoulli_trial]. # Sampling with fixed size #* Use DataSet::count() to get the dataset size, with the fixed size, we can turn this into sampling with factor. #* [Reservoir Sampling|https://en.wikipedia.org/wiki/Reservoir_sampling] is another commonly used algorithm to randomly choose a sample of k items from a list S containing n items, where n is either a very large or unknown number, and there are different reservoir sampling algorithms that support reservoir support both sampling with replacement and sampling without replacement. > Create sample operator for Dataset > ---------------------------------- > > Key: FLINK-1901 > URL: https://issues.apache.org/jira/browse/FLINK-1901 > Project: Flink > Issue Type: Improvement > Components: Core > Reporter: Theodore Vasiloudis > Assignee: Chengxiang Li > > In order to be able to implement Stochastic Gradient Descent and a number of > other machine learning algorithms we need to have a way to take a random > sample from a Dataset. > We need to be able to sample with or without replacement from the Dataset, > choose the relative size of the sample, and set a seed for reproducibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)