[
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695166#comment-14695166
]
ASF GitHub Bot commented on FLINK-1901:
---------------------------------------
Github user thvasilo commented on a diff in the pull request:
https://github.com/apache/flink/pull/949#discussion_r36969808
--- Diff: flink-java/src/main/java/org/apache/flink/api/java/DataSet.java
---
@@ -1057,7 +1061,68 @@ public long count() throws Exception {
public UnionOperator<T> union(DataSet<T> other){
return new UnionOperator<T>(this, other,
Utils.getCallLocationName());
}
+
+ //
--------------------------------------------------------------------------------------------
+ // Sample
+ //
--------------------------------------------------------------------------------------------
+
+ /**
+ * Generate a sample of DataSet by the probability fraction of each
element.
+ *
+ * @param withReplacement Whether element can be selected more than
once.
+ * @param fraction Probability that each element is chosen,
should be [0,1] without replacement,
+ * and [0, ∞) with replacement. While fraction
is larger than 1, the elements are
+ * expected to be selected multi times into
sample on average.
+ * @return The sampled DataSet
+ */
+ public MapPartitionOperator<T, T> sample(final boolean withReplacement,
final double fraction) {
+ return sample(withReplacement, fraction, Utils.RNG.nextLong());
+ }
+
+ /**
+ * Generate a sample of DataSet by the probability fraction of each
element.
+ *
+ * @param withReplacement Whether element can be selected more than
once.
+ * @param fraction Probability that each element is chosen,
should be [0,1] without replacement,
+ * and [0, ∞) with replacement. While fraction
is larger than 1, the elements are
+ * expected to be selected multi times into
sample on average.
+ * @param seed random number generator seed.
+ * @return The sampled DataSet
+ */
+ public MapPartitionOperator<T, T> sample(final boolean withReplacement,
final double fraction, final long seed) {
+ return mapPartition(new SampleWithFraction<T>(withReplacement,
fraction, seed));
+ }
+
+ /**
+ * Generate a sample of DataSet which contains fixed size elements.
+ *
+ * @param withReplacement Whether element can be selected more than
once.
+ * @param numSample The expected sample size.
+ * @return The sampled DataSet
+ */
--- End diff --
Maybe we want to include a note that this kind of sampling currently takes
2 passes over the data, and recommend using fraction unless exact precision is
necessary.
> Create sample operator for Dataset
> ----------------------------------
>
> Key: FLINK-1901
> URL: https://issues.apache.org/jira/browse/FLINK-1901
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Theodore Vasiloudis
> Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of
> other machine learning algorithms we need to have a way to take a random
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset,
> choose the relative or exact size of the sample, set a seed for
> reproducibility, and support sampling within iterations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)