[
https://issues.apache.org/jira/browse/FLINK-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-1742.
--------------------------------
Resolution: Duplicate
Contained as in FLINK-1901
> Sample data points for MultipleLinearRegression to support proper SGD
> ---------------------------------------------------------------------
>
> Key: FLINK-1742
> URL: https://issues.apache.org/jira/browse/FLINK-1742
> Project: Flink
> Issue Type: Improvement
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Priority: Minor
> Labels: ML
>
> Currently the stochastic gradient descent method is applied to all data
> points of the {{MultipleLinearRegression}} implementation. In order to scale
> to huge data sets, each MultipleLinearRegression iteration should perform the
> SGD only on a random subset of data points. Therefore, proper data point
> sampling should be added to the {{MultipleLinearRegression}} implementation.
> An easy implementation would simply be a filter which flips for each data
> point a coin deciding whether to take or to discard it. The downside of this
> approach is that the whole data set has to be processed. It would be
> beneficial if a sampling operator does not have to process the whole data set
> given that it knows the data set's size. This assumption should be true for
> cached data sets in an iteration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)