[ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576999#comment-14576999
 ] 

ASF GitHub Bot commented on FLINK-2072:
---------------------------------------

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31902243
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
    +the complexities that usually come with having to deal with big data 
learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in 
data, and using those
    +learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set 
of inputs
    +(predictors) to a set of outputs. The learning is done using a __training 
set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
    +further divided into classification and regression problems. In 
classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a 
user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical 
values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities 
in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the 
data from the
    +descriptive features. Unsupervised learning can also be used for feature 
selection, for example
    +through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities 
of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
    +member which represents the label, which could be the class in a 
classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data 
Set, which you can
    +[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, 
String, String, String, String, String, 
String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those 
rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow 
us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of 
datasets using that format can be
    +found [in the LibSVM datasets 
website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML 
provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function 
available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` 
function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set 
here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    --- End diff --
    
    My thoughts were that we will provide the whole thing as an example 
program, somewhere in Flink examples. But I can added the imports needed here 
as well, by section.


> Add a quickstart guide for FlinkML
> ----------------------------------
>
>                 Key: FLINK-2072
>                 URL: https://issues.apache.org/jira/browse/FLINK-2072
>             Project: Flink
>          Issue Type: New Feature
>          Components: Documentation, Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>             Fix For: 0.9
>
>
> We need a quickstart guide that introduces users to the core concepts of 
> FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to