[
https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553923#comment-14553923
]
ASF GitHub Bot commented on FLINK-2034:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/688#discussion_r30784183
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
under the License.
-->
+The Machine Learning (ML) library for Flink is a new effort to bring
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end
we will be providing
+detailed documentation along with examples for every part of the system.
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data
streams.
+
+FlinkML will allow data scientists to test their models locally and using
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will
continue to extend the
+library with more algorithms. An example of how simple it is to create a
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
+
+val learner = MultipleLinearRegression()
+
+val parameters = ParameterMap()
+ .add(MultipleLinearRegression.Stepsize, 1.0)
+ .add(MultipleLinearRegression.Iterations, 10)
+ .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
+
+val model = learner.fit(data, parameters)
--- End diff --
With the new pipelining, this has to be updated.
> Add vision and roadmap for ML library to docs
> ---------------------------------------------
>
> Key: FLINK-2034
> URL: https://issues.apache.org/jira/browse/FLINK-2034
> Project: Flink
> Issue Type: Improvement
> Components: Machine Learning Library
> Reporter: Theodore Vasiloudis
> Assignee: Theodore Vasiloudis
> Labels: ML
> Fix For: 0.9
>
>
> We should have a document describing the vision of the Machine Learning
> library in Flink and an up to date roadmap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)