Hi Valeriy, Let me make sure we are on the same page.
"the current mllib implementation returns exactly the same model whether standardization is turned on or off. " This should be corrected as "the current mllib implementation returns exactly the same model whether standardization is turned on or off, given regularization is 0; otherwise, they are expected not the same" We expect 1. R glmnet and Spark ML share the same behavior, given all other conditions are the same. 1.1 Followed by 1, If regularization parameter is not zero, Spark ML would output 2 different models depending on whether standardization is turned on or off. The easiest way to check 1.1 is change setStandardization(false) to true for a test with regularization != 0, and run the test again which is expected to be failed. On Fri, Apr 27, 2018 at 3:08 PM, Valeriy Avanesov <acop...@gmail.com> wrote: > Hi all, > > maybe I'm missing something, but from what was discussed here I've > gathered that the current mllib implementation returns exactly the same > model whether standardization is turned on or off. > > I suggest to consider an R script (please, see below) which trains two > penalized logistic regression models (with glmnet) with and without > standardization. The models are clearly different. > > BTW. If penalization is turned off, the models are exactly the same. > > Therefore, the current mllib implementation doesn't follow glmnet. So, > does that make it a bug? > library(glmnet) > library(e1071) > > set.seed(13) > > # generate synthetic data > X = cbind(-500:500, (-500:500)*1000)/100000 > > y = sigmoid(X %*% c(1, 1)) > y = rbinom(y, 1, y) > > # define two testing points > xTest = rbind(c(-10, -10), c(-20, -20))/1000 > > # train two models: with and without standardization > lambda = 0.01 > > model = glmnet(X, y, family="binomial", standardize=TRUE, lambda=lambda) > print(predict(model, xTest, type="link")) > > model = glmnet(X, y, family="binomial", standardize=FALSE, lambda=lambda) > print(predict(model, xTest, type="link")) > > Best, > > Valeriy. > > On 04/25/2018 12:32 AM, DB Tsai wrote: > > As I’m one of the original authors, let me chime in for some comments. > > Without the standardization, the LBFGS will be unstable. For example, if a > feature is being x 10, then the corresponding coefficient should be / 10 to > make the same prediction. But without standardization, the LBFGS will > converge to different solution due to numerical stability. > > TLDR, this can be implemented in the optimizer or in the trainer. We > choose to implement in the trainer as LBFGS optimizer in breeze suffers > this issue. As an user, you don’t need to care much even you have one-hot > encoding features, and the result should match R. > > DB Tsai | Siri Open Source Technologies [not a contribution] | > Apple, Inc > > On Apr 20, 2018, at 5:56 PM, Weichen Xu <weichen...@databricks.com> wrote: > > Right. If regularization item isn't zero, then enable/disable > standardization will get different result. > But, if comparing results between R-glmnet and mllib, if we set the same > parameters for regularization/standardization/... , then we should get > the same result. If not, then maybe there's a bug. In this case you can > paste your testing code and I can help fix it. > > On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acop...@gmail.com> > wrote: > >> Hi all. >> >> Filipp, do you use l1/l2/elstic-net penalization? I believe in this case >> standardization matters. >> >> Best, >> >> Valeriy. >> >> On 04/17/2018 11:40 AM, Weichen Xu wrote: >> >> Not a bug. >> >> When disabling standadization, mllib LR will still do standadization for >> features, but it will scale the coefficients back at the end (after >> training finished). So it will get the same result with no standadization >> training. The purpose of it is to improve the rate of convergence. So >> the result should be always exactly the same with R's glmnet, no matter >> enable or disable standadization. >> >> Thanks! >> >> On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <yblia...@gmail.com> wrote: >> >>> Hi Filipp, >>> >>> MLlib’s LR implementation did the same way as R’s glmnet for >>> standardization. >>> Actually you don’t need to care about the implementation detail, as the >>> coefficients are always returned on the original scale, so it should be >>> return the same result as other popular ML libraries. >>> Could you point me where glmnet doesn’t scale features? >>> I suspect other issues cause your prediction quality dropped. If you can >>> share the code and data, I can help to check it. >>> >>> Thanks >>> Yanbo >>> >>> >>> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin <filipp.zhin...@gmail.com> >>> wrote: >>> >>> Hi all, >>> >>> While migrating from custom LR implementation to MLLib's LR >>> implementation my colleagues noticed that prediction quality dropped >>> (accoring to different business metrics). >>> It's turned out that this issue caused by features standardization >>> perfomed by MLLib's LR: disregard to 'standardization' option's value all >>> features are scaled during loss and gradient computation (as well as in few >>> other places): https://github.com/apache/spark/blob/6cc7021a40b64c >>> 41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/ >>> spark/ml/optim/aggregator/LogisticAggregator.scala#L229 >>> >>> According to comments in the code, standardization should be implemented >>> the same way it was implementes in R's glmnet package. I've looked through >>> corresponding Fortran code, an it seems like glmnet don't scale features >>> when you're disabling standardisation (but MLLib still does). >>> >>> Our models contains multiple one-hot encoded features and scaling them >>> is a pretty bad idea. >>> >>> Why MLLib's LR always scale all features? From my POV it's a bug. >>> >>> Thanks in advance, >>> Filipp. >>> >>> >>> >> >> > > >