Github user viirya commented on the issue:
https://github.com/apache/spark/pull/20146
Another workaround is, we can add some rows into iris dataset and make the
three values in Species column not frequency equal anymore.
For example, we add three more rows into iris. Now the frequency of
`versicolor` is 52, `virginica` is 51, `setosa` is 50. This can make `RFormula`
index them as: "setosa"->2, "versicolor"->0, "virginica"->1 to meet the
encoding of R glm.
```R
> iris
...
151 5.9 3.0 5.1 1.8 virginica
3
152 5.7 2.8 4.1 1.3 versicolor
2
153 5.7 2.8 4.1 1.3 versicolor
2
> training <- suppressWarnings(createDataFrame(iris))
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length +
Species))
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))
> stats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7057547 0.23221834 7.345478 1.246270e-11
Sepal_Length 0.3440362 0.04567319 7.532564 4.439116e-12
Species_versicolor -0.9736771 0.07073874 -13.764410 0.000000e+00
Species_virginica -0.9931144 0.09164069 -10.837046 0.000000e+00
> rStats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7057547 0.23221834 7.345478 1.246279e-11
Sepal.Length 0.3440362 0.04567319 7.532564 4.439021e-12
Speciesversicolor -0.9736771 0.07073874 -13.764410 2.469196e-28
Speciesvirginica -0.9931144 0.09164069 -10.837046 1.527430e-20
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE
```
@felixcheung @WeichenXu123 What do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]