[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

ASF GitHub Bot (JIRA) Thu, 11 Jun 2015 01:04:42 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581640#comment-14581640
 ]


ASF GitHub Bot commented on FLINK-2072:
---------------------------------------

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197679
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward 
process, abstracting away
    +the complexities that usually come with having to deal with big data 
learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple 
supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines 
if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and 
using those
    +learned patterns to make predictions about the future. We can categorize 
most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a 
set of inputs
    +(features) to a set of outputs. The learning is done using a *training 
set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised 
learning problems are
    +further divided into classification and regression problems. In 
classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user 
is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting 
(real) numerical
    +values, often called the dependent variable, for example what the 
temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and 
regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the 
data from the
    +descriptive features. Unsupervised learning can also be used for feature 
selection, for example
    +through [principal components 
analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink 
program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your 
project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of 
Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised 
learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, 
label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of 
the example and a `Double`
    +member which represents the label, which could be the class in a 
classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML 
repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of 
patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, 
where the first 3 columns
    +are the features and last column is the class, and the 4th column 
indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). 
You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for 
more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    --- End diff --
    
    Good idea, I will use that instead.


> Add a quickstart guide for FlinkML
> ----------------------------------
>
>                 Key: FLINK-2072
>                 URL: https://issues.apache.org/jira/browse/FLINK-2072
>             Project: Flink
>          Issue Type: New Feature
>          Components: Documentation, Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>             Fix For: 0.9
>
>
> We need a quickstart guide that introduces users to the core concepts of 
> FlinkML to get them up and running quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML

Reply via email to