[
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635082#comment-14635082
]
Sachin Goel commented on FLINK-1901:
------------------------------------
Okay. So I checked the whole code and well, the random sampling I'm using there
is never used inside an iteration. So as far as that goes, there are no
problems. However, it would certainly be good to have a separate random
sampling module, which can work on any data set, for that matter.
[[email protected]], do you think there is any utility for a sampling
procedure different from random? That is, suppose there is a function which
maps every element in the dataset to its probability of selection.
[~chengxiang li], yes. There is an ongoing PR
(https://github.com/apache/flink/pull/757). And yes. It would certainly make
sense to have a generic sample function.
> Create sample operator for Dataset
> ----------------------------------
>
> Key: FLINK-1901
> URL: https://issues.apache.org/jira/browse/FLINK-1901
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Theodore Vasiloudis
> Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of
> other machine learning algorithms we need to have a way to take a random
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset,
> choose the relative size of the sample, and set a seed for reproducibility.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)