[
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649429#comment-14649429
]
ASF GitHub Bot commented on FLINK-1901:
---------------------------------------
Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/949#issuecomment-126740651
Thanks for your contribution @ChengXiangLi. The code is really well tested
and well structured. Great work :-)
I had only some minor comments. There is however one thing I'm not so sure
about. With the current implementation, all parallel tasks of the sampling
operator will get the same random generator/seed value. Thus, every node will
generate the same sequence of random numbers. I think this can have a negative
influence on the sampling. What we could do is to use
`RichMapPartitionFunction` instead of the `MapPartitionFunction`. With the rich
function, we either have access to the subtask index, given by
`getRuntimeContext().getIndexOfThisSubtask()`, which we could use to modify
the initial seed or we generate the random number generator in the `open`
method (this method is executed on the TaskManager). Assuming that the clocks
are not completely synchronized and that the individual tasks will be
instantiated not at the same time, this could give us less correlated random
number sequences. What do you think?
> Create sample operator for Dataset
> ----------------------------------
>
> Key: FLINK-1901
> URL: https://issues.apache.org/jira/browse/FLINK-1901
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Theodore Vasiloudis
> Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of
> other machine learning algorithms we need to have a way to take a random
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset,
> choose the relative size of the sample, and set a seed for reproducibility.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)