[GitHub] spark pull request: SPARK-7579 [ML] [DOC] User guide update for On...

sryza Tue, 19 May 2015 14:52:05 -0700

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6126#discussion_r30651674
  
    --- Diff: docs/ml-features.md ---
    @@ -183,6 +183,101 @@ for words_label in wordsDataFrame.select("words", 
"label").take(3):
     </div>
     </div>
     
    +## OneHotEncoder
    +
    +[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of 
label indices to a column of binary vectors, with at most a single one-value. 
This encoding allows algorithms which expect continuous features, such as 
Logistic Regression, to use categorical features as well. The 
[OneHotEncoder](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) 
class provides this functionality. By default, the resulting binary vector has 
a component for each category, so with 5 categories, an input value of 2.0 
would map to an output vector of (0.0, 0.0, 1.0, 0.0, 0.0). If the 
`includeFirst` is set to false, the first category is omitted, so the output 
vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input 
value of 0.0 would map to a vector of all zeros. Including the first category 
makes the vector columns linearly dependent because they sum up to one.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
    +
    +val df = sqlContext.createDataFrame(Seq(
    +  (0, "a"),
    +  (1, "b"),
    +  (2, "c"),
    +  (3, "a"),
    +  (4, "a"),
    +  (5, "c")
    +)).toDF("id", "category")
    +
    +val indexer = new StringIndexer()
    +  .setInputCol("category")
    +  .setOutputCol("categoryIndex")
    +  .fit(df)
    +val indexed = indexer.transform(df)
    +
    +val encoder = new OneHotEncoder().setInputCol("categoryIndex").
    +  setOutputCol("categoryVec")
    +val encoded = encoder.transform(indexed)
    +encoded.select("id", "categoryVec").foreach(println)
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +{% highlight java %}
    +import com.google.common.collect.Lists;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.ml.feature.OneHotEncoder;
    +import org.apache.spark.ml.feature.StringIndexer;
    +import org.apache.spark.ml.feature.StringIndexerModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.RowFactory;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.Metadata;
    +import org.apache.spark.sql.types.StructField;
    +import org.apache.spark.sql.types.StructType;
    +
    +JavaRDD<Row> jrdd = jsc.parallelize(Lists.newArrayList(
    +    RowFactory.create(0, "a"),
    +    RowFactory.create(1, "b"),
    +    RowFactory.create(2, "c"),
    +    RowFactory.create(3, "a"),
    +    RowFactory.create(4, "a"),
    +    RowFactory.create(5, "c")
    +));
    +StructType schema = new StructType(new StructField[]{
    +    new StructField("id", DataTypes.DoubleType, false, Metadata.empty()),
    +    new StructField("category", DataTypes.StringType, false, 
Metadata.empty())
    +});
    +DataFrame df = sqlContext.createDataFrame(jrdd, schema);
    +StringIndexerModel indexer = new StringIndexer()
    +  .setInputCol("category")
    +  .setOutputCol("categoryIndex")
    +  .fit(df);
    +DataFrame indexed = indexer.transform(df);
    +
    +OneHotEncoder encoder = new OneHotEncoder()
    +  .setInputCol("categoryIndex")
    +  .setOutputCol("categoryVec");
    +DataFrame encoded = encoder.transform(indexed);
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="python" markdown="1">
    +{% highlight python %}
    +from pyspark.ml.feature import OneHotEncoder, StringIndexer
    +
    +df = sqlContext.createDataFrame([
    +  (0, "a"),
    +  (1, "b"),
    +  (2, "c"),
    +  (3, "a"),
    +  (4, "a"),
    +  (5, "c")
    +], ["id", "category"])
    +
    +stringIndexer = StringIndexer(inputCol="category", 
outputCol="categoryIndex")
    +model = stringIndexer.fit(df)
    +indexed = model.transform(df)
    +encoder = OneHotEncoder(includeFirst=False, inputCol="categoryIndex", 
outputCol="categoryVec")
    +encoded = encoder.transform(indexed)
    +</div>
    --- End diff --
    
    Posting a patch that fixes this.  My jekyll efforts have been thwarted with 
errors like:
    
`/home/sandy/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:318:
 polymorphic expression cannot be instantiated to expected type;`
    
    Any idea how to get past these?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-7579 [ML] [DOC] User guide update for On...

Reply via email to