He doesn't only talk about black box vs statistical, he talks about model based vs prediction based. He says that if you validate predictions, you don't need to (necessarily) worry about model misspecification.

A linear regression model can be misspecified, and it can be overfit. Just fitting the model will not inform you whether either of these is the case. Because the model is simple and well understood, there is ways to check model misspecification and overfit in several ways. A train-test-split doesn't exactly tell you whether the model is misspecified (errors could be non-normal and prediction could still be good),
but it gives you an idea if the model is "useful".

Basically: you need to validate whatever you did. There are model-based approaches and there are prediction based approaches. Prediction based approaches are always applicable, model-based approaches are usually more limited and harder to do (but if you find a good model you got a model of the process, which is great!). But you need to pick at least one of the two approaches.


On 6/12/19 2:36 PM, C W wrote:
Thank you both for the papers references.

@ Andreas,
What is your take? And what are you implying?

The Breiman (2001) paper points out the black box vs. statistical approach. I call them black box vs. open box. He advocates black box in the paper.
Black box:
y <--- nature <--- x

Open box:
y <--- linear regression <---- x

Decision trees and neural nets are black box model. They require large amount of data to train, and skip the part where it tries to understand nature.

Because it is a black box, you can't open up to see what's inside. Linear regression is a very simple model that you can use to approximate nature, but the key thing is that you need to know how the data are generated.

@ Brown,
I know nothing about molecular modeling. The paper your linked "Beware of q2!" paper raises some interesting point, as far as I see in sklearn linear regression, score is R^2.

On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:


    On 6/4/19 8:44 PM, C W wrote:
    > Thank you all for the replies.
    >
    > I agree that prediction accuracy is great for evaluating
    black-box ML
    > models. Especially advanced models like neural networks, or
    > not-so-black models like LASSO, because they are NP-hard to solve.
    >
    > Linear regression is not a black-box. I view prediction accuracy
    as an
    > overkill on interpretable models. Especially when you can use
    > R-squared, coefficient significance, etc.
    >
    > Prediction accuracy also does not tell you which feature is
    important.
    >
    > What do you guys think? Thank you!
    >
    Did you read the paper that I sent? ;)
    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to