Also looks like you need to scale down the regularization for Linear
Regression by 1/2n since the loss function is scaled by 1/2n (refer the API
documentation for Linear Regression). I was able to get close enough
results after this modification.
--spark-ml code--
val linearModel = new
[Edit] I got few details wrong in my eagerness to reply:
1. Spark uses the corrected standard deviation with sqrt(n-1), and scikit
uses the one with sqrt(n).
2. You should scale down the regularization by sum of weights, in case you
have a column of weights. When there are no weights, it is
Hi Frank
Thanks for this question. I have been comparing logistic regression in
sklearn with spark mllib as well. Your example code gave me a perfect way
to compare what is going on in both the packages.
I looked at both the source codes. There are quite a few differences in how
the model
(this was also posted to stackoverflow on 03/10)
I am setting up a very simple logistic regression problem in scikit-learn
and in spark.ml, and the results diverge: the models they learn are
different, but I can't figure out why (data is the same, model type is the
same, regularization is the