Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Andreas Mueller Thu, 13 Jun 2019 07:43:46 -0700

He doesn't only talk about black box vs statistical, he talks aboutmodel based vs prediction based.He says that if you validate predictions, you don't need to(necessarily) worry about model misspecification.

A linear regression model can be misspecified, and it can be overfit.Just fitting the model will not inform you whether either of these isthe case.Because the model is simple and well understood, there is ways to checkmodel misspecification and overfit in several ways.A train-test-split doesn't exactly tell you whether the model ismisspecified (errors could be non-normal and prediction could still begood),

but it gives you an idea if the model is "useful".

Basically: you need to validate whatever you did. There are model-basedapproaches and there are prediction based approaches.Prediction based approaches are always applicable, model-basedapproaches are usually more limited and harder to do (but if you find agood model you got a model of the process, which is great!). But youneed to pick at least one of the two approaches.



On 6/12/19 2:36 PM, C W wrote:

Thank you both for the papers references.

@ Andreas,
What is your take? And what are you implying?
The Breiman (2001) paper points out the black box vs. statisticalapproach. I call them black box vs. open box. He advocates black boxin the paper.
Black box:
y <--- nature <--- x

Open box:
y <--- linear regression <---- x
Decision trees and neural nets are black box model. They require largeamount of data to train, and skip the part where it tries tounderstand nature.
Because it is a black box, you can't open up to see what's inside.Linear regression is a very simple model that you can use toapproximate nature, but the key thing is that you need to know how thedata are generated.
@ Brown,
I know nothing about molecular modeling. The paper your linked "Bewareof q2!" paper raises some interesting point, as far as I see insklearn linear regression, score is R^2.
On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller <t3k...@gmail.com<mailto:t3k...@gmail.com>> wrote:
    On 6/4/19 8:44 PM, C W wrote:
    > Thank you all for the replies.
    >
    > I agree that prediction accuracy is great for evaluating
    black-box ML
    > models. Especially advanced models like neural networks, or
    > not-so-black models like LASSO, because they are NP-hard to solve.
    >
    > Linear regression is not a black-box. I view prediction accuracy
    as an
    > overkill on interpretable models. Especially when you can use
    > R-squared, coefficient significance, etc.
    >
    > Prediction accuracy also does not tell you which feature is
    important.
    >
    > What do you guys think? Thank you!
    >
    Did you read the paper that I sent? ;)
    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Reply via email to