Hi all,

maybe I'm missing something, but from what was discussed here I've gathered that the current mllib implementation returns exactly the same model whether standardization is turned on or off.

I suggest to consider an R script (please, see below) which trains two penalized logistic regression models (with glmnet) with and without standardization. The models are clearly different.

BTW. If penalization is turned off, the models are exactly the same.

Therefore, the current mllib implementation doesn't follow glmnet. So, does that make it a bug?

library(glmnet)
library(e1071)

set.seed(13)

# generate synthetic data
X = cbind(-500:500, (-500:500)*1000)/100000

y = sigmoid(X %*% c(1, 1))
y = rbinom(y, 1, y)

# define two testing points
xTest = rbind(c(-10, -10), c(-20, -20))/1000

# train two models: with and without standardization
lambda = 0.01

model = glmnet(X, y, family="binomial", standardize=TRUE, lambda=lambda)
print(predict(model, xTest, type="link"))

model = glmnet(X, y, family="binomial", standardize=FALSE, lambda=lambda)
print(predict(model, xTest, type="link"))

Best,

Valeriy.


On 04/25/2018 12:32 AM, DB Tsai wrote:
As I’m one of the original authors, let me chime in for some comments.

Without the standardization, the LBFGS will be unstable. For example, if a feature is being x 10, then the corresponding coefficient should be / 10 to make the same prediction. But without standardization, the LBFGS will converge to different solution due to numerical stability.

TLDR, this can be implemented in the optimizer or in the trainer. We choose to implement in the trainer as LBFGS optimizer in breeze suffers this issue. As an user, you don’t need to care much even you have one-hot encoding features, and the result should match R.

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

On Apr 20, 2018, at 5:56 PM, Weichen Xu <weichen...@databricks.com <mailto:weichen...@databricks.com>> wrote:

Right. If regularization item isn't zero, then enable/disable standardization will get different result. But, if comparing results between R-glmnet and mllib, if we set the same parameters for regularization/standardization/... , then we should get the same result. If not, thenmaybe there's a bug. In this case you can paste your testing code and I can help fix it.

On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acop...@gmail.com <mailto:acop...@gmail.com>> wrote:

    Hi all.

    Filipp, do you use l1/l2/elstic-net penalization? I believe in
    this case standardization matters.

    Best,

    Valeriy.


    On 04/17/2018 11:40 AM, Weichen Xu wrote:
    Not a bug.

    When disabling standadization, mllib LR will still do
    standadization for features, but it will scale the coefficients
    back at the end (after training finished). So it will get the
    same result with no standadization training. The purpose of it
    is to improve the rate of convergence. So the result should be
    always exactly the same with R's glmnet, no matter enable or
    disable standadization.

    Thanks!

    On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <yblia...@gmail.com
    <mailto:yblia...@gmail.com>> wrote:

        Hi Filipp,

        MLlib’s LR implementation did the same way as R’s glmnet for
        standardization.
        Actually you don’t need to care about the implementation
        detail, as the coefficients are always returned on the
        original scale, so it should be return the same result as
        other popular ML libraries.
        Could you point me where glmnet doesn’t scale features?
        I suspect other issues cause your prediction quality
        dropped. If you can share the code and data, I can help to
        check it.

        Thanks
        Yanbo


        On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin
        <filipp.zhin...@gmail.com
        <mailto:filipp.zhin...@gmail.com>> wrote:

        Hi all,

        While migrating from custom LR implementation to MLLib's LR
        implementation my colleagues noticed that prediction
        quality dropped (accoring to different business metrics).
        It's turned out that this issue caused by features
        standardization perfomed by MLLib's LR: disregard to
        'standardization' option's value all features are scaled
        during loss and gradient computation (as well as in few
        other places):
        
https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
        
<https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229>

        According to comments in the code, standardization should
        be implemented the same way it was implementes in R's
        glmnet package. I've looked through corresponding Fortran
        code, an it seems like glmnet don't scale features when
        you're disabling standardisation (but MLLib still does).

        Our models contains multiple one-hot encoded features and
        scaling them is a pretty bad idea.

        Why MLLib's LR always scale all features? From my POV it's
        a bug.

        Thanks in advance,
        Filipp.







Reply via email to