[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14661768#comment-14661768
 ] 

ASF GitHub Bot commented on FLINK-1901:
---------------------------------------

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-128690389
  
    The current state with the `RichMapPartitionFunctions` looks good to me 
:+1: 
    
    You're right that user usually want to fix the size for the whole sample. 
An easy solution could be to assign each item an index, see 
`DataSetUtils.zipWithIndex`. Then we can compute the maximum index (which is 
effectively counting the data set elements). This gives us the range from which 
have to sample. By generating a parallel sequence of the size of our sample 
size with `env.generateSequence(maxIndex)`, we could then sample from `[0, 
maxIndex]`. At last we would have to join this data set with the original data 
set which has the indices assigned. There are probably more efficient 
algorithms out there than this one.
    
    Just ping me when you've found a solution for the problem. Looking forward 
reviewing it :-)


> Create sample operator for Dataset
> ----------------------------------
>
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of 
> other machine learning algorithms we need to have a way to take a random 
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, 
> choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to