Github user yanboliang commented on the issue:
https://github.com/apache/spark/pull/15788
@jkbradley Thanks for your comments. I fully understand and also carefully
considered your concerns. I totally agree we should make ```spark.glm``` match
Râs ```glm``` as much as possible. However, I found the output prediction of
```spark.glm``` for the binomial family is meaningless if we only provides the
probability value.
Letâs look at how R ```glm``` works for binomial family(with string
label):
```R
data <- iris[iris$Species %in% c("versicolor", "virginica"), ]
model <- glm(Species ~ ., data = data, family = binomial(link = "logitâ))
predict(model, data, type = "responseâ)
f <- factor(data$Species)
prediction <- round(predict(model, data, type = "response"))
factor(prediction, labels = levels(f))
```
The label is string and R ```glm``` will encode it into factor
automatically when training. The default prediction output is log-space
prediction, but we can convert it into probability by specifying ```type =
"response"```, then we can get the prediction value in the form ```0``` and
```1```.
However, native R ```factor``` method can get the map between the string
values and the converted numeric values. For example, ```versicolor``` maps to
```0``` and ```virginica``` maps to ```1```. Users can convert the numeric
prediction value into string value and vice versa.
In Spark, we use ```StringIndexer``` to encode label and
```IndexToString``` to convert back. This was wrapped in ```RFormula``` and
SparkR users can not use it which means they can not get the map between string
label and numeric label. I think getting the original label as prediction value
is one of the most important use cases in SparkR users, so I proposed to make
this change.
Another option is to implement ```spark.factor``` as a function to SparkR
users, then they will be able to use it as ```factor``` in native R. But I
think the internal of ```RFormula```(which is closely related with the
implementation of ```spark.factor```) is private for SparkR users, and we are
not ready to make it public(We are still making improvements to ```RFormula```,
see [SPARK-15540](https://issues.apache.org/jira/browse/SPARK-15540)), so I
give up this opinion.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]