[ 
https://issues.apache.org/jira/browse/FLINK-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-1742.
--------------------------------
    Resolution: Duplicate

Contained as in FLINK-1901

> Sample data points for MultipleLinearRegression to support proper SGD
> ---------------------------------------------------------------------
>
>                 Key: FLINK-1742
>                 URL: https://issues.apache.org/jira/browse/FLINK-1742
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Priority: Minor
>              Labels: ML
>
> Currently the stochastic gradient descent method is applied to all data 
> points of the {{MultipleLinearRegression}} implementation. In order to scale 
> to huge data sets, each MultipleLinearRegression iteration should perform the 
> SGD only on a random subset of data points. Therefore, proper data point 
> sampling should be added to the {{MultipleLinearRegression}} implementation. 
> An easy implementation would simply be a filter which flips for each data 
> point a coin deciding whether to take or to discard it. The downside of this 
> approach is that the whole data set has to be processed. It would be 
> beneficial if a sampling operator does not have to process the whole data set 
> given that it knows the data set's size. This assumption should be true for 
> cached data sets in an iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to