Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6126#discussion_r30285440
  
    --- Diff: docs/ml-features.md ---
    @@ -183,6 +183,75 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
     </div>
     </div>
     
    +## OneHotEncoder
    +
    +One-hot encoding is a way of formatting categorical features as input into 
machine learning algorithms. It maps a column of label indices to a column of 
binary vectors, with at most a single one-value. The 
[OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) 
class provides this functionality. By default, the resulting binary vector has 
a component for each category, so with 5 categories, an input value of 2.0 
would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If the includeFirst 
is set to false, the first category is omitted, so the output vector for the 
previous example would be (0.0, 1.0, 0.0, 0.0) and an input value of 0.0 would 
map to a vector of all zeros. Including the first category makes the vector 
columns linearly dependent because they sum up to one.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
    +
    +val df = sqlContext.createDataFrame(Seq(
    +  (0, "a"),
    +  (1, "b"),
    +  (2, "c"),
    +  (3, "a"),
    +  (4, "a"),
    +  (5, "c")
    +)).toDF("id", "category")
    +
    +val indexer = new StringIndexer().setInputCol("category").
    --- End diff --
    
    Will spread these out.  Placing the dot before `setOutputCol` as opposed to 
on the line above makes it so that the code can't be pasted into the shell.  
Has this come up before and did we rule against that in spite of this?
    
    The rationale for including StringIndexer is that I recall find the 
scikit-learn documentation, which omits this aspect, confusing the first few 
times I looked at it.  Including StringIndexer makes it easier for someone 
who's used to dealing with factors in R to understand what's required to get 
from start to finish.  If you still think it's superfluous, happy to take it 
out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to