[jira] [Comment Edited] (SPARK-5888) Add OneHotEncoder as a Transformer

Herman van Hovell tot Westerflier (JIRA) Sun, 10 May 2015 06:40:49 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537168#comment-14537168
 ]


Herman van Hovell tot Westerflier edited comment on SPARK-5888 at 5/10/15 1:39 
PM:
-----------------------------------------------------------------------------------

Hi,

When I try to use OneHotEncoder in combination with StringIndexer:
{code:none}
val in = "col"
val out = "col_out"
val indexedNominalCol = in + "Idx"
val indexer = new StringIndexer().
       setInputCol(in).
       setOutputCol(indexedNominalCol)
val encoder = new OneHotEncoder().
       setInputCol(indexedNominalCol).
       setOutputCol(out)
val pipeline = new Pipeline().setStages(Array(indexer, encoder))
val model = pipeline.fit(...)
{code}
It gives me the following error:
{noformat}
java.util.NoSuchElementException: None.get
        at scala.None$.get(Option.scala:313)
        at scala.None$.get(Option.scala:311)
        at 
org.apache.spark.ml.feature.OneHotEncoder$$anonfun$transformSchema$3.apply(OneHotEncoder.scala:72)
        at 
org.apache.spark.ml.feature.OneHotEncoder$$anonfun$transformSchema$3.apply(OneHotEncoder.scala:72)
        at scala.Option.getOrElse(Option.scala:120)
        at 
org.apache.spark.ml.feature.OneHotEncoder.transformSchema(OneHotEncoder.scala:72)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:58)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:118)
{noformat}
I think this is due to the fact that the OneHotEncode assumes that the relevant 
domain information is available when transformSchema(...) is called, whereas 
the StringIndexer is much lazier, and only has this information available after 
the fitting process has completed. The combination of StringIndexing with 
OneHotEncoding is a typical usecase. Is it possible to fix this, I am willing 
to take a stab at it if needed.

Kind regards,
Herman



was (Author: hvanhovell):
Hi,

When I try to use OneHotEncoder in combination with StringIndexer:
{code:none}
val indexedNominalCol = in + "Idx"
val indexer = new StringIndexer().
       setInputCol(in).
       setOutputCol(indexedNominalCol)
val encoder = new OneHotEncoder().
       setInputCol(indexedNominalCol).
       setOutputCol(out)
indexer :: encoder :: Nil
{code}
It gives me the following error:
{noformat}
java.util.NoSuchElementException: None.get
        at scala.None$.get(Option.scala:313)
        at scala.None$.get(Option.scala:311)
        at 
org.apache.spark.ml.feature.OneHotEncoder$$anonfun$transformSchema$3.apply(OneHotEncoder.scala:72)
        at 
org.apache.spark.ml.feature.OneHotEncoder$$anonfun$transformSchema$3.apply(OneHotEncoder.scala:72)
        at scala.Option.getOrElse(Option.scala:120)
        at 
org.apache.spark.ml.feature.OneHotEncoder.transformSchema(OneHotEncoder.scala:72)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:164)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:164)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:58)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:118)
{noformat}
I think this is due to the fact that the OneHotEncode assumes that the relevant 
domain information is available when transformSchema(...) is called, whereas 
the StringIndexer is much lazier, and only has this information available after 
the fitting process has completed. The combination of StringIndexing with 
OneHotEncoding is a typical usecase. Is it possible to fix this, I am willing 
to take a stab at it if needed.

Kind regards,
Herman


> Add OneHotEncoder as a Transformer
> ----------------------------------
>
>                 Key: SPARK-5888
>                 URL: https://issues.apache.org/jira/browse/SPARK-5888
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Sandy Ryza
>             Fix For: 1.4.0
>
>
> `OneHotEncoder` takes a categorical column and output a vector column, which 
> stores the category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
>   .setInputCol("countryIndex")
>   .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names 
> properly in the output column. We need to discuss the default naming scheme 
> and whether we should let it process multiple categorical columns at the same 
> time.
> One category (the most frequent one) should be removed from the output to 
> make the output columns linear independent. Or this could be an option tuned 
> on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-5888) Add OneHotEncoder as a Transformer

Reply via email to