[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679955#comment-14679955
 ] 

ASF GitHub Bot commented on FLINK-1901:
---------------------------------------

Github user ChengXiangLi commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-129409569
  
    Thanks for the input, @tillrohrmann and @sachingoel0101 . I would like to 
implement the fixed size sampling with only one pass through source dataset, 
since while user try to sample a dataset, the dataset should be quite large in 
most cases, pass through the dataset multi times would add much more effort. In 
my solution, the basic idea of fixed size sample in distributed stream is that: 
generate a random number for each input elements as its weight, select top K 
elements with max weight, as the weights are generated randomly, so the 
selected top K elements are selected randomly. You can see more detail 
information in the code and javadoc.


> Create sample operator for Dataset
> ----------------------------------
>
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of 
> other machine learning algorithms we need to have a way to take a random 
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, 
> choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to