Github user viirya commented on the issue:
https://github.com/apache/spark/pull/20146
Hmm, I reconsider this
https://github.com/apache/spark/pull/20146#pullrequestreview-87070102. Even we
use a dataset without duplicate values, if the string indexer order from R glm
is different than the index used by RFormula, we still can't get the same
results because looks like R glm doesn't follow frequency/alphabet.
For example, I've tried the dataset Puromycin:
```R
> training <- suppressWarnings(createDataFrame(Puromycin))
> stats <- summary(spark.glm(training, conc ~ rate + state))
> rStats <- summary(glm(conc ~ rate + state, data = Puromycin))
> rStats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.595461828 0.157462177 -3.781618 1.171709e-03
rate 0.006642461 0.001022196 6.498228 2.464757e-06
stateuntreated 0.136323828 0.095090605 1.433620 1.671302e-01
> stats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.459138000 0.130420375 -3.520447 2.150817e-03
rate 0.006642461 0.001022196 6.498228 2.464757e-06
state_treated -0.136323828 0.095090605 -1.433620 1.671302e-01
```
You can see because the string index of state column is still different
between R glm and RFormula, we can't get the same results.
A workaround to this is that we can use a dataset which doesn't need string
indexing. What do you think? @felixcheung
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]