Hi all.

Filipp, do you use l1/l2/elstic-net penalization? I believe in this case standardization matters.

Best,

Valeriy.


On 04/17/2018 11:40 AM, Weichen Xu wrote:
Not a bug.

When disabling standadization, mllib LR will still do standadization for features, but it will scale the coefficients back at the end (after training finished). So it will get the same result with no standadization training. The purpose of it is to improve the rate of convergence. So the result should be always exactly the same with R's glmnet, no matter enable or disable standadization.

Thanks!

On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <yblia...@gmail.com <mailto:yblia...@gmail.com>> wrote:

    Hi Filipp,

    MLlib’s LR implementation did the same way as R’s glmnet for
    standardization.
    Actually you don’t need to care about the implementation detail,
    as the coefficients are always returned on the original scale, so
    it should be return the same result as other popular ML libraries.
    Could you point me where glmnet doesn’t scale features?
    I suspect other issues cause your prediction quality dropped. If
    you can share the code and data, I can help to check it.

    Thanks
    Yanbo


    On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin
    <filipp.zhin...@gmail.com <mailto:filipp.zhin...@gmail.com>> wrote:

    Hi all,

    While migrating from custom LR implementation to MLLib's LR
    implementation my colleagues noticed that prediction quality
    dropped (accoring to different business metrics).
    It's turned out that this issue caused by features
    standardization perfomed by MLLib's LR: disregard to
    'standardization' option's value all features are scaled during
    loss and gradient computation (as well as in few other places):
    
https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
    
<https://github.com/apache/spark/blob/6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229>

    According to comments in the code, standardization should be
    implemented the same way it was implementes in R's glmnet
    package. I've looked through corresponding Fortran code, an it
    seems like glmnet don't scale features when you're disabling
    standardisation (but MLLib still does).

    Our models contains multiple one-hot encoded features and scaling
    them is a pretty bad idea.

    Why MLLib's LR always scale all features? From my POV it's a bug.

    Thanks in advance,
    Filipp.




Reply via email to