On 3/27/20 6:20 PM, Gael Varoquaux wrote:
Thanks for the link Andy. This is indeed very interesting!

On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression,
MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
order).
Maybe LinearRegression docstring should more strongly suggest to use Ridge
with small regularization in practice.
Yes! I actually wonder if we should not remove LinearRegression. It's a
bit frightening me that so many people use it. The only time that I've
seen it used in a scientific people, it was a mistake and it shouldn't
have been used.

I seldom advocate for deprecating :).


People use sklearn for inference. I'm not sure we should deprecate this usecase even though it's not
our primary motivation.

Also, there's an inconsistency here: Logistic Regression has an L2 penalty by default (to the annoyance of some), while Linear Regression does not. We have discussed the meaning of the different classes for linear models several times, they are certainly not consistent (ridge, lasso and lr are three classes for squared loss while all three are in LogisticRegression for the log loss).

I think to many "use statsmodels" is not a satisfying answer.

I have seen people argue that linear regression or logistic regression should throw an error on colinear data, and I think that's not in the spirit of sklearn (even though we had this as a warning in discriminant analysis until recently).
But we should probably have more clear signaling about this.

Our documentation doesn't really emphasize the prediction vs inference point enough, I think.

Btw, we could also make our linear regression more stable by using the minimum norm solution via the SVD.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to