Repository: spark
Updated Branches:
  refs/heads/branch-1.4 31e640499 -> e847d8621


[SPARK-7556] [ML] [DOC] Add user guide for spark.ml Binarizer, including Scala, 
Java and Python examples

JIRA: https://issues.apache.org/jira/browse/SPARK-7556

Author: Liang-Chi Hsieh <[email protected]>

Closes #6116 from viirya/binarizer_doc and squashes the following commits:

40cb677 [Liang-Chi Hsieh] Better print out.
5b7ef1d [Liang-Chi Hsieh] Make examples more clear.
1bf9c09 [Liang-Chi Hsieh] For comments.
6cf8cba [Liang-Chi Hsieh] Add user guide for Binarizer.

(cherry picked from commit c8696337e2a5878f3171eb574c0a1365d45814c9)
Signed-off-by: Joseph K. Bradley <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e847d862
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e847d862
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e847d862

Branch: refs/heads/branch-1.4
Commit: e847d86215f950f21195f7dec42f2e7d5b11a364
Parents: 31e6404
Author: Liang-Chi Hsieh <[email protected]>
Authored: Fri May 15 15:05:04 2015 -0700
Committer: Joseph K. Bradley <[email protected]>
Committed: Fri May 15 15:05:13 2015 -0700

----------------------------------------------------------------------
 docs/ml-features.md | 84 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/e847d862/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 0cbebcb..5df61dd 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -183,6 +183,90 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
 </div>
 </div>
 
+## Binarizer
+
+Binarization is the process of thresholding numerical features to binary 
features. As some probabilistic estimators make assumption that the input data 
is distributed according to [Bernoulli 
distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution), a binarizer 
is useful for pre-processing the input data with continuous numerical features.
+
+A simple 
[Binarizer](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) class 
provides this functionality. Besides the common parameters of `inputCol` and 
`outputCol`, `Binarizer` has the parameter `threshold` used for binarizing 
continuous numerical features. The features greater than the threshold, will be 
binarized to 1.0. The features equal to or less than the threshold, will be 
binarized to 0.0. The example below shows how to binarize numerical features.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+import org.apache.spark.ml.feature.Binarizer
+import org.apache.spark.sql.DataFrame
+
+val data = Array(
+  (0, 0.1),
+  (1, 0.8),
+  (2, 0.2)
+)
+val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", 
"feature")
+
+val binarizer: Binarizer = new Binarizer()
+  .setInputCol("feature")
+  .setOutputCol("binarized_feature")
+  .setThreshold(0.5)
+
+val binarizedDataFrame = binarizer.transform(dataFrame)
+val binarizedFeatures = binarizedDataFrame.select("binarized_feature")
+binarizedFeatures.collect().foreach(println)
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% highlight java %}
+import com.google.common.collect.Lists;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.Binarizer;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(
+  RowFactory.create(0, 0.1),
+  RowFactory.create(1, 0.8),
+  RowFactory.create(2, 0.2)
+));
+StructType schema = new StructType(new StructField[]{
+  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
+  new StructField("feature", DataTypes.DoubleType, false, Metadata.empty())
+});
+DataFrame continuousDataFrame = jsql.createDataFrame(jrdd, schema);
+Binarizer binarizer = new Binarizer()
+  .setInputCol("feature")
+  .setOutputCol("binarized_feature")
+  .setThreshold(0.5);
+DataFrame binarizedDataFrame = binarizer.transform(continuousDataFrame);
+DataFrame binarizedFeatures = binarizedDataFrame.select("binarized_feature");
+for (Row r : binarizedFeatures.collect()) {
+  Double binarized_value = r.getDouble(0);
+  System.out.println(binarized_value);
+}
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+{% highlight python %}
+from pyspark.ml.feature import Binarizer
+
+continuousDataFrame = sqlContext.createDataFrame([
+  (0, 0.1),
+  (1, 0.8),
+  (2, 0.2)
+], ["label", "feature"])
+binarizer = Binarizer(threshold=0.5, inputCol="feature", 
outputCol="binarized_feature")
+binarizedDataFrame = binarizer.transform(continuousDataFrame)
+binarizedFeatures = binarizedDataFrame.select("binarized_feature")
+for binarized_feature, in binarizedFeatures.collect():
+  print binarized_feature
+{% endhighlight %}
+</div>
+</div>
 
 # Feature Selectors
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to