Re: Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-13 Thread Dhanesh Padmanabhan
Also looks like you need to scale down the regularization for Linear Regression by 1/2n since the loss function is scaled by 1/2n (refer the API documentation for Linear Regression). I was able to get close enough results after this modification. --spark-ml code-- val linearModel = new

Re: Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-13 Thread Dhanesh Padmanabhan
[Edit] I got few details wrong in my eagerness to reply: 1. Spark uses the corrected standard deviation with sqrt(n-1), and scikit uses the one with sqrt(n). 2. You should scale down the regularization by sum of weights, in case you have a column of weights. When there are no weights, it is

Re: Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-13 Thread Dhanesh Padmanabhan
Hi Frank Thanks for this question. I have been comparing logistic regression in sklearn with spark mllib as well. Your example code gave me a perfect way to compare what is going on in both the packages. I looked at both the source codes. There are quite a few differences in how the model

Differences between scikit-learn and Spark.ml for regression toy problem

2017-03-12 Thread Frank Astier
(this was also posted to stackoverflow on 03/10) I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the