[GitHub] spark issue #15788: [SPARK-18291][SparkR][ML] SparkR glm predict should outp...

yanboliang Mon, 21 Nov 2016 23:45:54 -0800

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/15788
  
    @jkbradley Thanks for your comments. I fully understand and also carefully 
considered your concerns. I totally agree we should make ```spark.glm``` match 
Râs ```glm``` as much as possible. However, I found the output prediction of 
```spark.glm``` for the binomial family is meaningless if we only provides the 
probability value.
    
    Letâs look at how R ```glm``` works for binomial family(with string 
label):
    ```R
    data <- iris[iris$Species %in% c("versicolor", "virginica"), ]
    model <- glm(Species ~ ., data = data, family = binomial(link = "logitâ))
    predict(model, data, type = "responseâ)
    f <- factor(data$Species)
    prediction <- round(predict(model, data, type = "response"))
    factor(prediction, labels = levels(f))
    ```
    The label is string and R ```glm``` will encode it into factor 
automatically when training. The default prediction output is log-space 
prediction, but we can convert it into probability by specifying ```type = 
"response"```, then we can get the prediction value in the form ```0``` and 
```1```.
    However, native R ```factor``` method can get the map between the string 
values and the converted numeric values. For example, ```versicolor``` maps to 
```0``` and ```virginica``` maps to ```1```. Users can convert the numeric 
prediction value into string value and vice versa.
    
    In Spark, we use ```StringIndexer``` to encode label and 
```IndexToString``` to convert back. This was wrapped in ```RFormula``` and 
SparkR users can not use it which means they can not get the map between string 
label and numeric label. I think getting the original label as prediction value 
is one of the most important use cases in SparkR users, so I proposed to make 
this change.
    Another option is to implement ```spark.factor``` as a function to SparkR 
users, then they will be able to use it as ```factor``` in native R. But I 
think the internal of ```RFormula```(which is closely related with the 
implementation of ```spark.factor```) is private for SparkR users, and we are 
not ready to make it public(We are still making improvements to ```RFormula```, 
see [SPARK-15540](https://issues.apache.org/jira/browse/SPARK-15540)), so I 
give up this opinion.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15788: [SPARK-18291][SparkR][ML] SparkR glm predict should outp...

Reply via email to