[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

viirya Fri, 05 Jan 2018 16:19:15 -0800

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20146
  
    Another workaround is, we can add some rows into iris dataset and make the 
three values in Species column not frequency equal anymore.
    
    For example, we add three more rows into iris. Now the frequency of 
`versicolor` is 52, `virginica` is 51, `setosa` is 50. This can make `RFormula` 
index them as: "setosa"->2, "versicolor"->0, "virginica"->1 to meet the 
encoding of R glm.
    
    ```R
    > iris
    ...
    151          5.9         3.0          5.1         1.8  virginica            
  3
    152          5.7         2.8          4.1         1.3 versicolor            
  2
    153          5.7         2.8          4.1         1.3 versicolor            
  2
    > training <- suppressWarnings(createDataFrame(iris))                       
                                   
    > stats <- summary(spark.glm(training, Sepal_Width ~ Sepal_Length + 
Species))                                  
    > rStats <- summary(glm(Sepal.Width ~ Sepal.Length + Species, data = iris)) 
                                   
    > stats$coefficients                                                        
                                     
                         Estimate Std. Error    t value     Pr(>|t|)
    (Intercept)         1.7057547 0.23221834   7.345478 1.246270e-11
    Sepal_Length        0.3440362 0.04567319   7.532564 4.439116e-12
    Species_versicolor -0.9736771 0.07073874 -13.764410 0.000000e+00
    Species_virginica  -0.9931144 0.09164069 -10.837046 0.000000e+00
    > rStats$coefficients                                                       
                                     
                        Estimate Std. Error    t value     Pr(>|t|)
    (Intercept)        1.7057547 0.23221834   7.345478 1.246279e-11
    Sepal.Length       0.3440362 0.04567319   7.532564 4.439021e-12
    Speciesversicolor -0.9736771 0.07073874 -13.764410 2.469196e-28
    Speciesvirginica  -0.9931144 0.09164069 -10.837046 1.527430e-20
    > coefs <- stats$coefficients
    > rCoefs <- rStats$coefficients
    > all(abs(rCoefs - coefs) < 1e-4)
    [1] TRUE
    ```
    
    @felixcheung @WeichenXu123 What do you think?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

Reply via email to