[GitHub] spark pull request: [SPARK-7556][ML][Doc] Add user guide for Binar...

jkbradley Wed, 13 May 2015 14:06:39 -0700

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6116#discussion_r30276091
  
    --- Diff: docs/ml-features.md ---
    @@ -183,6 +183,83 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
     </div>
     </div>
     
    +## Binarizer
    +
    +Binarization is the process of thresholding numerical features to binary 
features. As some probabilistic estimators make assumption that the input data 
is distributed according to [Bernoulli 
distribution](http://en.wikipedia.org/wiki/Bernoulli_distribution), a binarizer 
is useful for pre-processing the input data with continuous numerical features.
    +
    +A simple 
[Binarizer](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) class 
provides this functionality. Besides the common parameters of `InputCol` and 
`OutputCol`, `Binarizer` has the parameter `Threshold` used for binarizing 
continuous numerical features. The features greater than the threshold, will be 
binarized to 1.0. The features equal to or less than the threshold, will be 
binarized to 0.0. The example below shows how to binarize numerical features.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.Binarizer
    +
    +val data = Array(
    +  (0, 0.1),
    +  (1, 0.8),
    +  (2, 0.2)
    +)
    +val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", 
"feature")
    +
    +val binarizer: Binarizer = new Binarizer()
    +  .setInputCol("feature")
    +  .setOutputCol("binarized_feature")
    +  .setThreshold(0.5)
    +
    
+binarizer.transform(dataFrame).select("binarized_feature").collect().foreach(println)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import com.google.common.collect.Lists;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.ml.feature.Binarizer;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +
    +JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(
    +  RowFactory.create(0, 0.1),
    +  RowFactory.create(1, 0.8),
    +  RowFactory.create(2, 0.2)
    +));
    +StructType schema = new StructType(new StructField[]{
    +  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
    +  new StructField("feature", DataTypes.DoubleType, false, Metadata.empty())
    +});
    +DataFrame continuousDataFrame = jsql.createDataFrame(jrdd, schema);
    +Binarizer binarizer = new Binarizer().setInputCol("feature")
    --- End diff --
    
    Please put 1 setter call per line (as you did in Scala)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7556][ML][Doc] Add user guide for Binar...

Reply via email to