[ https://issues.apache.org/jira/browse/FLINK-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14553896#comment-14553896 ]
ASF GitHub Bot commented on FLINK-2050: --------------------------------------- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/704#discussion_r30783065 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/StandardScaler.scala --- @@ -22,38 +22,47 @@ import breeze.linalg import breeze.numerics.sqrt import breeze.numerics.sqrt._ import org.apache.flink.api.common.functions._ +import org.apache.flink.api.common.typeinfo.TypeInformation import org.apache.flink.api.scala._ import org.apache.flink.configuration.Configuration -import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} import org.apache.flink.ml.math.Breeze._ -import org.apache.flink.ml.math.Vector +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{TransformOperation, FitOperation, Transformer} import org.apache.flink.ml.preprocessing.StandardScaler.{Mean, Std} +import scala.reflect.ClassTag + /** Scales observations, so that all features have a user-specified mean and standard deviation. * By default for [[StandardScaler]] transformer mean=0.0 and std=1.0. * - * This transformer takes a [[Vector]] of values and maps it to a - * scaled [[Vector]] such that each feature has a user-specified mean and standard deviation. + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature has a user-specified mean and standard + * deviation. * * This transformer can be prepended to all [[Transformer]] and - * [[org.apache.flink.ml.common.Learner]] implementations which expect an input of - * [[Vector]]. + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. * * @example * {{{ * val trainingDS: DataSet[Vector] = env.fromCollection(data) * val transformer = StandardScaler().setMean(10.0).setStd(2.0) * - * transformer.transform(trainingDS) + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) * }}} * * =Parameters= * - * - [[StandardScaler.Mean]]: The mean value of transformed data set; by default equal to 0 - * - [[StandardScaler.Std]]: The standard deviation of the transformed data set; by default + * - [[Mean]]: The mean value of transformed data set; by default equal to 0 + * - [[Std]]: The standard deviation of the transformed data set; by default * equal to 1 --- End diff -- Why use just the top-level type here, but the fully qualified one in the ALS docstring? > Add pipelining mechanism for chainable transformers and estimators > ------------------------------------------------------------------ > > Key: FLINK-2050 > URL: https://issues.apache.org/jira/browse/FLINK-2050 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Labels: ML > Fix For: 0.9 > > > The key concept of an easy to use ML library is the quick and simple > construction of data analysis pipelines. Scikit-learn's approach to define > transformers and estimators seems to be a really good solution to this > problem. I propose to follow a similar path, because it makes FlinkML > flexible in terms of code reuse as well as easy for people coming from > Scikit-learn to use the FlinkML. -- This message was sent by Atlassian JIRA (v6.3.4#6332)