Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/6126#discussion_r30289178
--- Diff: docs/ml-features.md ---
@@ -183,6 +183,75 @@ for words_label in wordsDataFrame.select("words",
"label").take(3):
</div>
</div>
+## OneHotEncoder
+
+One-hot encoding is a way of formatting categorical features as input into
machine learning algorithms. It maps a column of label indices to a column of
binary vectors, with at most a single one-value. The
[OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder)
class provides this functionality. By default, the resulting binary vector has
a component for each category, so with 5 categories, an input value of 2.0
would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If the includeFirst
is set to false, the first category is omitted, so the output vector for the
previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would
map to a vector of all zeros. Including the first category makes the vector
columns linearly dependent because they sum up to one.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
+
+val df = sqlContext.createDataFrame(Seq(
+ (0, "a"),
+ (1, "b"),
+ (2, "c"),
+ (3, "a"),
+ (4, "a"),
+ (5, "c")
+)).toDF("id", "category")
+
+val indexer = new StringIndexer().setInputCol("category").
--- End diff --
Just spoke with Xiangrui. He voted against putting dots on the line above
since it's unusual, saying users could use paste mode (":p") in the Scala
shell. Perhaps that's the historical precedent? Could be worth discussing on
the dev list.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]