[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

viirya Sat, 06 Jan 2018 07:03:06 -0800

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/20146
  
    Hmm, I reconsider this 
https://github.com/apache/spark/pull/20146#pullrequestreview-87070102. Even we 
use a dataset without duplicate values, if the string indexer order from R glm 
is different than the index used by RFormula, we still can't get the same 
results because looks like R glm doesn't follow frequency/alphabet.
    
    For example, I've tried the dataset Puromycin:
    
    ```R
    > training <- suppressWarnings(createDataFrame(Puromycin))                  
                                     
    > stats <- summary(spark.glm(training, conc ~ rate + state))
    > rStats <- summary(glm(conc ~ rate + state, data = Puromycin))
    > rStats$coefficients
                       Estimate  Std. Error   t value     Pr(>|t|)
    (Intercept)    -0.595461828 0.157462177 -3.781618 1.171709e-03
    rate            0.006642461 0.001022196  6.498228 2.464757e-06
    stateuntreated  0.136323828 0.095090605  1.433620 1.671302e-01
    > stats$coefficients
                      Estimate  Std. Error   t value     Pr(>|t|)
    (Intercept)   -0.459138000 0.130420375 -3.520447 2.150817e-03
    rate           0.006642461 0.001022196  6.498228 2.464757e-06
    state_treated -0.136323828 0.095090605 -1.433620 1.671302e-01
    ```
    
    You can see because the string index of state column is still different 
between R glm and RFormula, we can't get the same results.
    
    A workaround to this is that we can use a dataset which doesn't need string 
indexing. What do you think? @felixcheung



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...

Reply via email to