yunfengzhou-hub commented on code in PR #97:
URL: https://github.com/apache/flink-ml/pull/97#discussion_r889790107
##########
flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java:
##########
@@ -94,6 +120,26 @@ public static <T> DataStream<T> reduce(DataStream<T> input,
ReduceFunction<T> fu
}
}
+ /**
+ * Takes a randomly sampled subset of elements in a bounded data stream.
+ *
+ * <p>If the number of elements in the stream is smaller than expected
number of samples, all
+ * elements will be included in the sample.
+ *
+ * @param input The input data stream.
+ * @param numSamples The number of elements to be sampled.
+ * @param randomSeed The seed to randomly pick elements as sample.
+ * @return A data stream containing a list of the sampled elements.
+ */
+ public static <T> DataStream<List<T>> sample(
+ DataStream<T> input, int numSamples, long randomSeed) {
+ return input.transform(
+ "samplingOperator",
+ Types.LIST(input.getType()),
+ new SamplingOperator<>(numSamples, randomSeed))
+ .setParallelism(1);
Review Comment:
According to offline discussion, I agree with it that we should not change
parallelism to make it more generic. I'll make the change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]