Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/7211#discussion_r36770690
--- Diff: docs/ml-features.md ---
@@ -905,6 +906,74 @@ scaledData = scalerModel.transform(dataFrame)
</div>
</div>
+## MinMaxScaler
+
+`MinMaxScaler` transforms a dataset of `Vector` rows, rescaling each
feature to a specific range (often [0, 1]). It takes parameters:
+
+* `min`: 0.0 by default. Lower bound after transformation, shared by all
features.
+* `max`: 1.0 by default. Upper bound after transformation, shared by all
features.
+
+`MinMaxScaler` computes summary statistics on a data set and produces a
`MinMaxScalerModel`. The model can then transform each feature individually
such that it is in the given range.
+
+The rescaled value for a feature E is calculated as,
+
+ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) +
min
+
+For the case E_{max} == E_{min}, Rescaled(e_i) = 0.5 * (max + min)
+
+Note that since zero values will probably be transformed to non-zero
values, output of the transformer will be DenseVector even for sparse input.
+
+More details can be found in the API docs for
+[MinMaxScaler](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)
and
+[MinMaxScalerModel](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel).
+
+The following example demonstrates how to load a dataset in libsvm format
and then rescale each feature to [0, 1].
+
+<div class="codetabs">
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.ml.feature.MinMaxScaler
+import org.apache.spark.mllib.util.MLUtils
+
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+val dataFrame = sqlContext.createDataFrame(data)
+val scaler = new MinMaxScaler()
+ .setInputCol("features")
+ .setOutputCol("scaledFeatures")
+
+// Compute summary statistics by fitting the StandardScaler
+val scalerModel = scaler.fit(dataFrame)
+
+// Normalize each feature to have unit standard deviation.
+val scaledData = scalerModel.transform(dataFrame)
+{% endhighlight %}
+</div>
+
+<div data-lang="java">
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.MinMaxScaler;
+import org.apache.spark.ml.feature.MinMaxScalerModel;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.sql.DataFrame;
+
+JavaRDD<LabeledPoint> data =
+ MLUtils.loadLibSVMFile(jsc.sc(),
"data/mllib/sample_libsvm_data.txt").toJavaRDD();
+DataFrame dataFrame = jsql.createDataFrame(data, LabeledPoint.class);
+MinMaxScaler scaler = new MinMaxScaler()
+ .setInputCol("features")
+ .setOutputCol("scaledFeatures");
+
+// Compute summary statistics by fitting the StandardScaler
+MinMaxScalerModel scalerModel = scaler.fit(dataFrame);
+
+// Normalize each feature to have unit standard deviation.
--- End diff --
Please update the inline comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]