[jira] [Commented] (FLINK-2050) Add pipelining mechanism for chainable transformers and estimators

ASF GitHub Bot (JIRA) Thu, 21 May 2015 01:44:19 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553896#comment-14553896
 ]


ASF GitHub Bot commented on FLINK-2050:
---------------------------------------

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/704#discussion_r30783065
  
    --- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/StandardScaler.scala
 ---
    @@ -22,38 +22,47 @@ import breeze.linalg
     import breeze.numerics.sqrt
     import breeze.numerics.sqrt._
     import org.apache.flink.api.common.functions._
    +import org.apache.flink.api.common.typeinfo.TypeInformation
     import org.apache.flink.api.scala._
     import org.apache.flink.configuration.Configuration
    -import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer}
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
     import org.apache.flink.ml.math.Breeze._
    -import org.apache.flink.ml.math.Vector
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{TransformOperation, FitOperation, 
Transformer}
     import org.apache.flink.ml.preprocessing.StandardScaler.{Mean, Std}
     
    +import scala.reflect.ClassTag
    +
     /** Scales observations, so that all features have a user-specified mean 
and standard deviation.
       * By default for [[StandardScaler]] transformer mean=0.0 and std=1.0.
       *
    -  * This transformer takes a [[Vector]] of values and maps it to a
    -  * scaled [[Vector]] such that each feature has a user-specified mean and 
standard deviation.
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
    +  * scaled subtype of [[Vector]] such that each feature has a 
user-specified mean and standard
    +  * deviation.
       *
       * This transformer can be prepended to all [[Transformer]] and
    -  * [[org.apache.flink.ml.common.Learner]] implementations which expect an 
input of
    -  * [[Vector]].
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
    +  * of [[Vector]].
       *
       * @example
       *          {{{
       *            val trainingDS: DataSet[Vector] = env.fromCollection(data)
       *            val transformer = StandardScaler().setMean(10.0).setStd(2.0)
       *
    -  *            transformer.transform(trainingDS)
    +  *            transformer.fit(trainingDS)
    +  *            val transformedDS = transformer.transform(trainingDS)
       *          }}}
       *
       * =Parameters=
       *
    -  * - [[StandardScaler.Mean]]: The mean value of transformed data set; by 
default equal to 0
    -  * - [[StandardScaler.Std]]: The standard deviation of the transformed 
data set; by default
    +  * - [[Mean]]: The mean value of transformed data set; by default equal 
to 0
    +  * - [[Std]]: The standard deviation of the transformed data set; by 
default
       * equal to 1
    --- End diff --
    
    Why use just the top-level type here, but the fully qualified one in the 
ALS docstring?


> Add pipelining mechanism for chainable transformers and estimators
> ------------------------------------------------------------------
>
>                 Key: FLINK-2050
>                 URL: https://issues.apache.org/jira/browse/FLINK-2050
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>              Labels: ML
>             Fix For: 0.9
>
>
> The key concept of an easy to use ML library is the quick and simple 
> construction of data analysis pipelines. Scikit-learn's approach to define 
> transformers and estimators seems to be a really good solution to this 
> problem. I propose to follow a similar path, because it makes FlinkML 
> flexible in terms of code reuse as well as easy for people coming from 
> Scikit-learn to use the FlinkML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2050) Add pipelining mechanism for chainable transformers and estimators

Reply via email to