Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/688#discussion_r30784959
--- Diff: docs/libs/ml/index.md ---
@@ -20,8 +20,100 @@ specific language governing permissions and limitations
under the License.
-->
+The Machine Learning (ML) library for Flink is a new effort to bring
scalable ML tools to the Flink
+community. Our goal is is to design and implement a system that is
scalable and can deal with
+problems of various sizes, whether your data size is measured in megabytes
or terabytes and beyond.
+We call this library FlinkML.
+
+An important concern for developers of ML systems is the amount of glue
code that developers are
+forced to write [1] in the process of implementing an end-to-end ML
system. Our goal with FlinkML
+is to help developers keep glue code to a minimum. The Flink ecosystem
provides a great setting to
+tackle this problem, with its scalable ETL capabilities that can be easily
combined inside the same
+program with FlinkML, allowing the development of robust pipelines without
the need to use yet
+another technology for data ingestion and data munging.
+
+Another goal for FlinkML is to make the library easy to use. To that end
we will be providing
+detailed documentation along with examples for every part of the system.
Our aim is that developers
+will be able to get started with writing their ML pipelines quickly, using
familiar programming
+concepts and terminology.
+
+Contrary to other data-processing systems, Flink exploits in-memory data
streaming, and natively
+executes iterative processing algorithms which are common in ML. We plan
to exploit the streaming
+nature of Flink, and provide functionality designed specifically for data
streams.
+
+FlinkML will allow data scientists to test their models locally and using
subsets of data, and then
+use the same code to run their algorithms at a much larger scale in a
cluster setting.
+
+We are inspired by other open source efforts to provide ML systems, in
particular
+[scikit-learn](http://scikit-learn.org/) for cleanly specifying ML
pipelines, and Sparkâs
+[MLLib](https://spark.apache.org/mllib/) for providing ML algorithms that
scale with problem and
+cluster sizes.
+
+We already have some of the building blocks for FlinkML in place, and will
continue to extend the
+library with more algorithms. An example of how simple it is to create a
learning model in
+FlinkML is given below:
+
+{% highlight scala %}
+// LabelbedVector is a feature vector with a label (class or real value)
+val data: DataSet[LabelVector] = ...
+
+val learner = MultipleLinearRegression()
+
+val parameters = ParameterMap()
+ .add(MultipleLinearRegression.Stepsize, 1.0)
+ .add(MultipleLinearRegression.Iterations, 10)
+ .add(MultipleLinearRegression.ConvergenceThreshold, 0.001)
+
+val model = learner.fit(data, parameters)
+{% endhighlight %}
+
+The roadmap below can provide an indication of the algorithms we aim to
implement in the coming
+months. Items in **bold** have already been implemented:
+
+
+* Pipelines of transformers and learners
+* Data pre-processing
+ * **Feature scaling**
+ * **Polynomial feature base mapper**
+ * Feature hashing
+ * Feature extraction for text
+ * Dimensionality reduction
+* Model selection and performance evaluation
+ * Cross-validation for model selection and evaluation
+* Supervised learning
+ * Optimization framework
+ * **Stochastic Gradient Descent**
+ * L-BFGS
+ * Generalized Linear Models
+ * **Multiple linear regression**
+ * LASSO, Ridge regression
+ * Multi-class Logistic regression
+ * Random forests
+ * **Support Vector Machines**
+* Unsupervised learning
+ * Clustering
+ * K-means clustering
+ * PCA
+* Recommendation
+ * **ALS**
+* Text analytics
+ * LDA
+* Statistical estimation tools
+* Distributed linear algebra
+* Streaming ML
--- End diff --
I like the idea of a roadmap so that people see where we're steering. But I
would probably put it on a separate page where we can also elaborate a little
bit what distributed linear algebra means and how it integrates.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---