Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20146
  
    Another workaround is, we can add some rows into iris dataset and make the 
three values in Species column not frequency equal anymore.
    
    For example, we add three more rows into iris. Now the frequency of 
`versicolor` is 52, `virginica` is 51, `setosa` is 50. This can make `RFormula` 
index them as: "setosa"->2, "versicolor"->0, "virginica"->1 to meet the 
encoding of R glm.
    
    ```R
    > iris
    ...
    151          5.9         3.0          5.1         1.8  virginica            
  3
    152          5.7         2.8          4.1         1.3 versicolor            
  2
    153          5.7         2.8          4.1         1.3 versicolor            
  2
    > training <- suppressWarnings(createDataFrame(iris))                       
                                   
    > stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
Species))                                  
    > rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris)) 
                                   
    > stats$coefficients                                                        
                                     
                         Estimate Std. Error    t value     Pr(>|t|)
    (Intercept)         1.7057547 0.23221834   7.345478 1.246270e-11
    Sepal_Length        0.3440362 0.04567319   7.532564 4.439116e-12
    Species_versicolor -0.9736771 0.07073874 -13.764410 0.000000e+00
    Species_virginica  -0.9931144 0.09164069 -10.837046 0.000000e+00
    > rStats$coefficients                                                       
                                     
                        Estimate Std. Error    t value     Pr(>|t|)
    (Intercept)        1.7057547 0.23221834   7.345478 1.246279e-11
    Sepal.Length       0.3440362 0.04567319   7.532564 4.439021e-12
    Speciesversicolor -0.9736771 0.07073874 -13.764410 2.469196e-28
    Speciesvirginica  -0.9931144 0.09164069 -10.837046 1.527430e-20
    > coefs <- stats$coefficients
    > rCoefs <- rStats$coefficients
    > all(abs(rCoefs - coefs) < 1e-4)
    [1] TRUE
    ```
    
    @felixcheung @WeichenXu123 What do you think?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to