[
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681142#comment-14681142
]
ASF GitHub Bot commented on FLINK-1901:
---------------------------------------
Github user ChengXiangLi commented on the pull request:
https://github.com/apache/flink/pull/949#issuecomment-129684685
Thanks, @thvasilo , that paper introduced an random sample algorithm which
is an extend algorithm of the one i described before, it has two threshold the
filter the element before sort, if element weight is bigger than "up
threshold", it would be included in final top K elements with very high
possibility, if element weight is smaller than "down threshold", it would not
be included in final top K elements with very high possibility. With accepted
possibility, we can filter the element with weigh larger than "up threshold" or
smaller than "down threshold", only sort the elements with weight between the
thresholds.
This is a very good algorithm, i would add it on my notebook for further
improvement, but i don't want to implement it right way. This PR is large
enough to me, so i would like to leave all the algorithms optimization in
future, and just keep the basic implementations of sample algorithms here, make
sure they are simple, easy to understand, work correctly, and they can be used
as the performance base line for the further improvement.
> Create sample operator for Dataset
> ----------------------------------
>
> Key: FLINK-1901
> URL: https://issues.apache.org/jira/browse/FLINK-1901
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Theodore Vasiloudis
> Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of
> other machine learning algorithms we need to have a way to take a random
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset,
> choose the relative size of the sample, and set a seed for reproducibility.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)