On 3/27/20 6:20 PM, Gael Varoquaux wrote:
Thanks for the link Andy. This is indeed very interesting!
On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression,
MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
order).
Maybe LinearRegression docstring should more strongly suggest to use Ridge
with small regularization in practice.
Yes! I actually wonder if we should not remove LinearRegression. It's a
bit frightening me that so many people use it. The only time that I've
seen it used in a scientific people, it was a mistake and it shouldn't
have been used.
I seldom advocate for deprecating :).
People use sklearn for inference. I'm not sure we should deprecate this
usecase even though it's not
our primary motivation.
Also, there's an inconsistency here: Logistic Regression has an L2
penalty by default (to the annoyance of some),
while Linear Regression does not. We have discussed the meaning of the
different classes for linear models several times,
they are certainly not consistent (ridge, lasso and lr are three classes
for squared loss while all three are in LogisticRegression for the log
loss).
I think to many "use statsmodels" is not a satisfying answer.
I have seen people argue that linear regression or logistic regression
should throw an error on colinear data, and I think that's not in the
spirit of sklearn
(even though we had this as a warning in discriminant analysis until
recently).
But we should probably have more clear signaling about this.
Our documentation doesn't really emphasize the prediction vs inference
point enough, I think.
Btw, we could also make our linear regression more stable by using the
minimum norm solution via the SVD.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn