Re: [MLLib] Logistic Regression and standadization

Yanbo Liang Fri, 13 Apr 2018 11:22:01 -0700

Hi Filipp,

MLlib’s LR implementation did the same way as R’s glmnet for standardization. 
Actually you don’t need to care about the implementation detail, as the 
coefficients are always returned on the original scale, so it should be return 
the same result as other popular ML libraries.
Could you point me where glmnet doesn’t scale features? 
I suspect other issues cause your prediction quality dropped. If you can share 
the code and data, I can help to check it.


Thanks
Yanbo

> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin <filipp.zhin...@gmail.com> wrote:
> 
> Hi all,
> 
> While migrating from custom LR implementation to MLLib's LR implementation my 
> colleagues noticed that prediction quality dropped (accoring to different 
> business metrics).
> It's turned out that this issue caused by features standardization perfomed 
> by MLLib's LR: disregard to 'standardization' option's value all features are 
> scaled during loss and gradient computation (as well as in few other places): 
> https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>  
> <https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229>
> 
> According to comments in the code, standardization should be implemented the 
> same way it was implementes in R's glmnet package. I've looked through 
> corresponding Fortran code, an it seems like glmnet don't scale features when 
> you're disabling standardisation (but MLLib still does).
> 
> Our models contains multiple one-hot encoded features and scaling them is a 
> pretty bad idea.
> 
> Why MLLib's LR always scale all features? From my POV it's a bug.
> 
> Thanks in advance,
> Filipp.
>

Re: [MLLib] Logistic Regression and standadization

Reply via email to