[
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576771#comment-14576771
]
ASF GitHub Bot commented on FLINK-2072:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/792#discussion_r31896684
--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
* This will be replaced by the TOC
{:toc}
-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward
process, abstracting away
+the complexities that usually come with having to deal with big data
learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple
supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines
if you're already
+familiar with Machine Learning (ML)
+
+As defined by Murphy [cite ML-APP] ML deals with detecting patterns in
data, and using those
+learned patterns to make predictions about the future. We can categorize
most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* Supervised Learning deals with learning a function (mapping) from a set
of inputs
+(predictors) to a set of outputs. The learning is done using a __training
set__ of (input,
+output) pairs that we use to approximate the mapping function. Supervised
learning problems are
+further divided into classification and regression problems. In
classification problems we try to
+predict the __class__ that an example belongs to, for example whether a
user is going to click on
+an ad or not. Regression problems are about predicting (real) numerical
values, often called the dependent
+variable, for example what the temperature will be tomorrow.
+
+* Unsupervised learning deals with discovering patterns and regularities
in the data. An example
+of this would be __clustering__, where we try to discover groupings of the
data from the
+descriptive features. Unsupervised learning can also be used for feature
selection, for example
+through [principal components
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Loading data
+
+For loading data to be used with FlinkML we can use the ETL capabilities
of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised
learning problems it is
+common to use the `LabeledVector` class to represent the `(features,
label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of
the example and a `Double`
+member which represents the label, which could be the class in a
classification problem, or the dependent
+variable for a regression problem.
+
+# TODO: Get dataset that has separate train and test sets
+As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data
Set, which you can
+[download from the UCI ML
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
+
+We can load the data as a `DataSet[String]` first:
+
+{% highlight scala %}
+
+val cancer = env.readCsvFile[(String, String, String, String, String,
String, String, String, String, String,
String)]("/path/to/breast-cancer-wisconsin.data")
+
+{% endhighlight %}
+
+The dataset has some missing values indicated by `?`. We can filter those
rows out and
+then transform the data into a `DataSet[LabeledVector]`. This will allow
us to use the
+dataset with the FlinkML classification algorithms.
+
+{% highlight scala %}
+
+val cancerLV = cancer
+ .map(_.productIterator.toList)
+ .filter(!_.contains("?"))
+ .map{list =>
+ val numList = list.map(_.asInstanceOf[String].toDouble)
+ LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
+ }
+
+{% endhighlight %}
+
+We can then use this data to train a learner.
+
+A common format for ML datasets is the LibSVM format and a number of
datasets using that format can be
+found [in the LibSVM datasets
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML
provides utilities for loading
+datasets using the LibSVM format through the `readLibSVM` function
available through the MLUtils object.
+You can also save datasets in the LibSVM format using the `writeLibSVM`
function.
+Let's import the Adult (a9a) dataset. You can download the
+[training set
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
+and the [test set
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
+
+We can simply import the dataset then using:
+
+{% highlight scala %}
+
+val adultTrain = MLUtils.readLibSVM("path/to/a8a")
+val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
+
+{% endhighlight %}
+
+This gives us a `DataSet[LabeledVector]` that we will use in the following
section to create a classifier.
+
+Due to an error in the test dataset we have to adjust the test data using
the following code, to
+ensure that the dimensionality of all test examples is 123, as with the
training set:
+
+{% highlight scala %}
+
+val adjustedTest = adultTest.map{lv =>
+ val vec = lv.vector.asBreeze
+ val padded = vec.padTo(123, 0.0).toDenseVector
+ val fvec = padded.fromBreeze
+ LabeledVector(lv.label, fvec)
+ }
+
+{% endhighlight %}
+
+## Classification
+
+Once we have imported the dataset we can train a `Predictor` such as a
linear SVM classifier.
+
+{% highlight scala %}
+
+val svm = SVM()
+ .setBlocks(env.getParallelism)
+ .setIterations(100)
+ .setRegularization(0.001)
+ .setStepsize(0.1)
+ .setSeed(42)
+
+svm.fit(adultTrain)
+
+{% endhighlight %}
+
+Let's now make predictions on the test set and see how well we do in terms
of absolute error
+We will also create a function that thresholds the predictions to the {-1,
1} scale that the
+dataset uses.
+
+{% highlight scala %}
+
+def thresholdPredictions(predictions: DataSet[(Double, Double)])
+: DataSet[(Double, Double)] = {
+ predictions.map {
+ truthPrediction =>
+ val truth = truthPrediction._1
+ val prediction = truthPrediction._2
+ val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
+ (truth, thresholdedPrediction)
+ }
+}
+
+val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
+
+val absoluteErrorSum = predictionPairs.collect().map{
+ case (truth, prediction) => Math.abs(truth - prediction)}.sum
--- End diff --
Shouldn't we divide the error by two? Otherwise each error will add `2` to
the absolute error sum.
> Add a quickstart guide for FlinkML
> ----------------------------------
>
> Key: FLINK-2072
> URL: https://issues.apache.org/jira/browse/FLINK-2072
> Project: Flink
> Issue Type: New Feature
> Components: Documentation, Machine Learning Library
> Reporter: Theodore Vasiloudis
> Assignee: Theodore Vasiloudis
> Fix For: 0.9
>
>
> We need a quickstart guide that introduces users to the core concepts of
> FlinkML to get them up and running quickly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)