[
https://issues.apache.org/jira/browse/FLINK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553977#comment-14553977
]
ASF GitHub Bot commented on FLINK-2034:
---------------------------------------
Github user thvasilo commented on a diff in the pull request:
https://github.com/apache/flink/pull/688#discussion_r30786639
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
under the License.
-->
+The Machine Learning (ML) library for Flink is a new effort to bring
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end
we will be providing
+detailed documentation along with examples for every part of the system.
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data
streams.
+
+FlinkML will allow data scientists to test their models locally and using
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML
pipelines, and Spark’s
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will
continue to extend the
+library with more algorithms. An example of how simple it is to create a
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
+
+val learner = MultipleLinearRegression()
+
+val parameters = ParameterMap()
+ .add(MultipleLinearRegression.Stepsize, 1.0)
+ .add(MultipleLinearRegression.Iterations, 10)
+ .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
+
+val model = learner.fit(data, parameters)
+{% endhighlight %}
+
+The roadmap below can provide an indication of the algorithms we aim to
implement in the coming
+months. Items in **bold** have already been implemented:
+
+
+* Pipelines of transformers and learners
+* Data pre-processing
+ * **Feature scaling**
+ * **Polynomial feature base mapper**
+ * Feature hashing
+ * Feature extraction for text
+ * Dimensionality reduction
+* Model selection and performance evaluation
+ * Cross-validation for model selection and evaluation
+* Supervised learning
+ * Optimization framework
+ * **Stochastic Gradient Descent**
+ * L-BFGS
+ * Generalized Linear Models
+ * **Multiple linear regression**
+ * LASSO, Ridge regression
+ * Multi-class Logistic regression
+ * Random forests
+ * **Support Vector Machines**
+* Unsupervised learning
+ * Clustering
+ * K-means clustering
+ * PCA
+* Recommendation
+ * **ALS**
+* Text analytics
+ * LDA
+* Statistical estimation tools
+* Distributed linear algebra
+* Streaming ML
--- End diff --
That's a good idea, I'll add a link to the roadmap instead.
> Add vision and roadmap for ML library to docs
> ---------------------------------------------
>
> Key: FLINK-2034
> URL: https://issues.apache.org/jira/browse/FLINK-2034
> Project: Flink
> Issue Type: Improvement
> Components: Machine Learning Library
> Reporter: Theodore Vasiloudis
> Assignee: Theodore Vasiloudis
> Labels: ML
> Fix For: 0.9
>
>
> We should have a document describing the vision of the Machine Learning
> library in Flink and an up to date roadmap.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)