Right. If regularization item isn't zero, then enable/disable standardization will get different result. But, if comparing results between R-glmnet and mllib, if we set the same parameters for regularization/standardization/... , then we should get the same result. If not, then maybe there's a bug. In this case you can paste your testing code and I can help fix it.
On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acop...@gmail.com> wrote: > Hi all. > > Filipp, do you use l1/l2/elstic-net penalization? I believe in this case > standardization matters. > > Best, > > Valeriy. > > On 04/17/2018 11:40 AM, Weichen Xu wrote: > > Not a bug. > > When disabling standadization, mllib LR will still do standadization for > features, but it will scale the coefficients back at the end (after > training finished). So it will get the same result with no standadization > training. The purpose of it is to improve the rate of convergence. So the > result should be always exactly the same with R's glmnet, no matter > enable or disable standadization. > > Thanks! > > On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <yblia...@gmail.com> wrote: > >> Hi Filipp, >> >> MLlib’s LR implementation did the same way as R’s glmnet for >> standardization. >> Actually you don’t need to care about the implementation detail, as the >> coefficients are always returned on the original scale, so it should be >> return the same result as other popular ML libraries. >> Could you point me where glmnet doesn’t scale features? >> I suspect other issues cause your prediction quality dropped. If you can >> share the code and data, I can help to check it. >> >> Thanks >> Yanbo >> >> >> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin <filipp.zhin...@gmail.com> >> wrote: >> >> Hi all, >> >> While migrating from custom LR implementation to MLLib's LR >> implementation my colleagues noticed that prediction quality dropped >> (accoring to different business metrics). >> It's turned out that this issue caused by features standardization >> perfomed by MLLib's LR: disregard to 'standardization' option's value all >> features are scaled during loss and gradient computation (as well as in few >> other places): https://github.com/apache/spark/blob/6cc7021a40b64c >> 41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/ >> apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229 >> >> According to comments in the code, standardization should be implemented >> the same way it was implementes in R's glmnet package. I've looked through >> corresponding Fortran code, an it seems like glmnet don't scale features >> when you're disabling standardisation (but MLLib still does). >> >> Our models contains multiple one-hot encoded features and scaling them is >> a pretty bad idea. >> >> Why MLLib's LR always scale all features? From my POV it's a bug. >> >> Thanks in advance, >> Filipp. >> >> >> > >