[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681142#comment-14681142
 ] 

ASF GitHub Bot commented on FLINK-1901:
---------------------------------------

Github user ChengXiangLi commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-129684685
  
    Thanks, @thvasilo , that paper introduced an random sample algorithm which 
is an extend algorithm of the one i described before, it has two threshold the 
filter the element before sort, if element weight is bigger than "up 
threshold", it would be included in final top K elements with very high 
possibility, if element weight is smaller than "down threshold", it would not 
be included in final top K elements with very high possibility. With accepted 
possibility, we can filter the element with weigh larger than "up threshold" or 
smaller than "down threshold", only sort the elements with weight between the 
thresholds.
     
    This is a very good algorithm, i would add it on my notebook for further 
improvement, but i don't want to implement it right way. This PR is large 
enough to me, so i would like to leave all the algorithms optimization in 
future, and just keep the basic implementations of sample algorithms here, make 
sure they are simple, easy to understand, work correctly, and they can be used 
as the performance base line for the further improvement.


> Create sample operator for Dataset
> ----------------------------------
>
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of 
> other machine learning algorithms we need to have a way to take a random 
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, 
> choose the relative size of the sample, and set a seed for reproducibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to