[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649429#comment-14649429
 ] 

ASF GitHub Bot commented on FLINK-1901:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-126740651
  
    Thanks for your contribution @ChengXiangLi. The code is really well tested 
and well structured. Great work :-)
    
    I had only some minor comments. There is however one thing I'm not so sure 
about. With the current implementation, all parallel tasks of the sampling 
operator will get the same random generator/seed value. Thus, every node will 
generate the same sequence of random numbers. I think this can have a negative 
influence on the sampling. What we could do is to use 
`RichMapPartitionFunction` instead of the `MapPartitionFunction`. With the rich 
function, we either have access to the subtask index, given by 
`getRuntimeContext().getIndexOfThisSubtask()`,  which we could use to modify 
the initial seed or we generate the random number generator in the `open` 
method (this method is executed on the TaskManager). Assuming that the clocks 
are not completely synchronized and that the individual tasks will be 
instantiated not at the same time, this could give us less correlated random 
number sequences. What do you think? 


> Create sample operator for Dataset
> ----------------------------------
>
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of 
> other machine learning algorithms we need to have a way to take a random 
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, 
> choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to