Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6126#discussion_r30287964
  
    --- Diff: docs/ml-features.md ---
    @@ -183,6 +183,75 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
     </div>
     </div>
     
    +## OneHotEncoder
    +
    +One-hot encoding is a way of formatting categorical features as input into 
machine learning algorithms. It maps a column of label indices to a column of 
binary vectors, with at most a single one-value. The 
[OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) 
class provides this functionality. By default, the resulting binary vector has 
a component for each category, so with 5 categories, an input value of 2.0 
would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If the includeFirst 
is set to false, the first category is omitted, so the output vector for the 
previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would 
map to a vector of all zeros. Including the first category makes the vector 
columns linearly dependent because they sum up to one.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
    +
    +val df = sqlContext.createDataFrame(Seq(
    +  (0, "a"),
    +  (1, "b"),
    +  (2, "c"),
    +  (3, "a"),
    +  (4, "a"),
    +  (5, "c")
    +)).toDF("id", "category")
    +
    +val indexer = new StringIndexer().setInputCol("category").
    --- End diff --
    
    That's a good point about dots on the line above.  I'm not sure, but I'll 
ask & let you know.
    
    I'll buy that about StringIndexer, so let's keep it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to