Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20146
  
    Seems to me we can't set string indexer order for R glm.
    
    A workaround is to encode the Species manually first. Then let R glm and 
spark.glm to fit the encoded Species column, instead of the original Species.
    
    I check the coefficients produced by this way. The coefficients from R glm 
and spark.glm are the same. But as Species is converted to numeric, the 
coefficients don't include Species_versicolor and Species_virginica now.
    
    ```R
    > encodedSpecies <- factor(iris$Species, levels=c("setosa", "versicolor", 
"virginica"))
    > iris$encodedSpecies <- as.numeric(encodedSpecies)
    > training <- suppressWarnings(createDataFrame(iris))
    > stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
encodedSpecies))                             
    > rStats <- summary(glm(Sepal.Width ~ Sepal.Length + encodedSpecies, data = 
iris))
    > stats$coefficients
                     Estimate Std. Error   t value     Pr(>|t|)
    (Intercept)     2.2595167  0.2604476  8.675514 7.105427e-15
    Sepal_Length    0.2937617  0.0582277  5.045051 1.318724e-06
    encodedSpecies -0.4593655  0.0588556 -7.804958 1.023848e-12
    > rStats$coefficients
                     Estimate Std. Error   t value     Pr(>|t|)
    (Intercept)     2.2595167  0.2604476  8.675514 7.079930e-15
    Sepal.Length    0.2937617  0.0582277  5.045051 1.318724e-06
    encodedSpecies -0.4593655  0.0588556 -7.804958 1.023755e-12
    > coefs <- stats$coefficients
    > rCoefs <- rStats$coefficients
    > all(abs(rCoefs - coefs) < 1e-4)
    [1] TRUE
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to