Github user viirya commented on the issue:
https://github.com/apache/spark/pull/20146
Seems to me we can't set string indexer order for R glm.
A workaround is to encode the Species manually first. Then let R glm and
spark.glm to fit the encoded Species column, instead of the original Species.
I check the coefficients produced by this way. The coefficients from R glm
and spark.glm are the same. But as Species is converted to numeric, the
coefficients don't include Species_versicolor and Species_virginica now.
```R
> encodedSpecies <- factor(iris$Species, levels=c("setosa", "versicolor",
"virginica"))
> iris$encodedSpecies <- as.numeric(encodedSpecies)
> training <- suppressWarnings(createDataFrame(iris))
> stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length +
encodedSpecies))
> rStats <- summary(glm(Sepal.Width ~ Sepal.Length + encodedSpecies, data =
iris))
> stats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2595167 0.2604476 8.675514 7.105427e-15
Sepal_Length 0.2937617 0.0582277 5.045051 1.318724e-06
encodedSpecies -0.4593655 0.0588556 -7.804958 1.023848e-12
> rStats$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2595167 0.2604476 8.675514 7.079930e-15
Sepal.Length 0.2937617 0.0582277 5.045051 1.318724e-06
encodedSpecies -0.4593655 0.0588556 -7.804958 1.023755e-12
> coefs <- stats$coefficients
> rCoefs <- rStats$coefficients
> all(abs(rCoefs - coefs) < 1e-4)
[1] TRUE
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]