Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/6126#discussion_r30651674
--- Diff: docs/ml-features.md ---
@@ -183,6 +183,101 @@ for words_label in wordsDataFrame.select("words",
"label").take(3):
</div>
</div>
+## OneHotEncoder
+
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of
label indices to a column of binary vectors, with at most a single one-value.
This encoding allows algorithms which expect continuous features, such as
Logistic Regression, to use categorical features as well. The
[OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder)
class provides this functionality. By default, the resulting binary vector has
a component for each category, so with 5 categories, an input value of 2.0
would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If the
`includeFirst` is set to false, the first category is omitted, so the output
vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input
value of 0.0 would map to a vector of all zeros. Including the first category
makes the vector columns linearly dependent because they sum up to one.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
+
+val df = sqlContext.createDataFrame(Seq(
+ (0, "a"),
+ (1, "b"),
+ (2, "c"),
+ (3, "a"),
+ (4, "a"),
+ (5, "c")
+)).toDF("id", "category")
+
+val indexer = new StringIndexer()
+ .setInputCol("category")
+ .setOutputCol("categoryIndex")
+ .fit(df)
+val indexed = indexer.transform(df)
+
+val encoder = new OneHotEncoder().setInputCol("categoryIndex").
+ setOutputCol("categoryVec")
+val encoded = encoder.transform(indexed)
+encoded.select("id", "categoryVec").foreach(println)
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% highlight java %}
+import com.google.common.collect.Lists;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.OneHotEncoder;
+import org.apache.spark.ml.feature.StringIndexer;
+import org.apache.spark.ml.feature.StringIndexerModel;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(
+ RowFactory.create(0, "a"),
+ RowFactory.create(1, "b"),
+ RowFactory.create(2, "c"),
+ RowFactory.create(3, "a"),
+ RowFactory.create(4, "a"),
+ RowFactory.create(5, "c")
+));
+StructType schema = new StructType(new StructField[]{
+ new StructField("id", DataTypes.DoubleType, false, Metadata.empty()),
+ new StructField("category", DataTypes.StringType, false,
Metadata.empty())
+});
+DataFrame df = sqlContext.createDataFrame(jrdd, schema);
+StringIndexerModel indexer = new StringIndexer()
+ .setInputCol("category")
+ .setOutputCol("categoryIndex")
+ .fit(df);
+DataFrame indexed = indexer.transform(df);
+
+OneHotEncoder encoder = new OneHotEncoder()
+ .setInputCol("categoryIndex")
+ .setOutputCol("categoryVec");
+DataFrame encoded = encoder.transform(indexed);
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+{% highlight python %}
+from pyspark.ml.feature import OneHotEncoder, StringIndexer
+
+df = sqlContext.createDataFrame([
+ (0, "a"),
+ (1, "b"),
+ (2, "c"),
+ (3, "a"),
+ (4, "a"),
+ (5, "c")
+], ["id", "category"])
+
+stringIndexer = StringIndexer(inputCol="category",
outputCol="categoryIndex")
+model = stringIndexer.fit(df)
+indexed = model.transform(df)
+encoder = OneHotEncoder(includeFirst=False, inputCol="categoryIndex",
outputCol="categoryVec")
+encoded = encoder.transform(indexed)
+</div>
--- End diff --
Posting a patch that fixes this. My jekyll efforts have been thwarted with
errors like:
`/home/sandy/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:318:
polymorphic expression cannot be instantiated to expected type;`
Any idea how to get past these?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]