[
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737960#comment-14737960
]
Eric Liang commented on SPARK-10523:
------------------------------------
We can convert to boolean easily enough, but supporting >2 levels will require
SPARK-7159.
> SparkR formula syntax to turn strings/factors into numerics
> -----------------------------------------------------------
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
> Issue Type: Improvement
> Components: ML, SparkR
> Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be
> turned into dummy variables immediately when calling a classifier. This way,
> the following R pattern is legal and often used:
> {code}
> library(magrittr)
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it
> appropriately by casting it to a 0/1 array before applying any machine
> learning. SparkR doesn't do this.
> {code}
> > ddf <- sqlContext %>%
> createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
> java.lang.IllegalArgumentException: Unsupported type for label: StringType
> at
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
> at
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
> at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
> at
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
> at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
> at
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
> at
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans
> as if they are integers here.
> {code}
> > ddf <- ddf %>%
> withColumn("to_pred", .$class == "a")
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models
> that are using multiple classes that need classification. This is perhaps
> less relevant for logistic regression (because it is a bit more like a
> one-off classification approach) but it certainly is relevant if you would
> want to use a formula for a randomforest and a column denotes, say, a type of
> flower from the iris dataset.
> Is there a good reason why this should not be a feature of formulas in Spark?
> I am aware of issue 8774, which looks like it is adressing a similar theme
> but a different issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]