[
https://issues.apache.org/jira/browse/SPARK-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng updated SPARK-2272:
---------------------------------
Assignee: DB Tsai
> Feature scaling which standardizes the range of independent variables or
> features of data.
> ------------------------------------------------------------------------------------------
>
> Key: SPARK-2272
> URL: https://issues.apache.org/jira/browse/SPARK-2272
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: DB Tsai
> Assignee: DB Tsai
>
> Feature scaling is a method used to standardize the range of independent
> variables or features of data. In data processing, it is also known as data
> normalization and is generally performed during the data preprocessing step.
> In this work, a trait called `VectorTransformer` is defined for generic
> transformation of a vector. It contains two methods, `apply` which applies
> transformation on a vector and `unapply` which applies inverse transformation
> on a vector.
> There are three concrete implementations of `VectorTransformer`, and they all
> can be easily extended with PMML transformation support.
> 1) `VectorStandardizer` - Standardises a vector given the mean and variance.
> Since the standardization will densify the output, the output is always in
> dense vector format.
>
> 2) `VectorRescaler` - Rescales a vector into target range specified by a
> tuple of two double values or two vectors as new target minimum and maximum.
> Since the rescaling will substrate the minimum of each column first, the
> output will always be in dense vector regardless of input vector type.
> 3) `VectorDivider` - Transforms a vector by dividing a constant or diving a
> vector with element by element basis. This transformation will preserve the
> type of input vector without densifying the result.
> Utility helper methods are implemented for taking an input of RDD[Vector],
> and then transformed RDD[Vector] and transformer are returned for dividing,
> rescaling, normalization, and standardization.
--
This message was sent by Atlassian JIRA
(v6.2#6252)