Till Rohrmann created FLINK-1742:
------------------------------------
Summary: Sample data points for MultipleLinearRegression to
support proper SGD
Key: FLINK-1742
URL: https://issues.apache.org/jira/browse/FLINK-1742
Project: Flink
Issue Type: Improvement
Components: Machine Learning Library
Reporter: Till Rohrmann
Priority: Minor
Currently the stochastic gradient descent method is applied to all data points
of the {{MultipleLinearRegression}} implementation. In order to scale to huge
data sets, each MultipleLinearRegression iteration should perform the SGD only
on a random subset of data points. Therefore, proper data point sampling should
be added to the {{MultipleLinearRegression}} implementation.
An easy implementation would simply be a filter which flips for each data point
a coin deciding whether to take or to discard it. The downside of this approach
is that the whole data set has to be processed. It would be beneficial if a
sampling operator does not have to process the whole data set given that it
knows the data set's size. This assumption should be true for cached data sets
in an iteration.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)