[FRIAM] Fwd: Civil Statistician: “Statistical Modeling: The Two Cultures,” Breiman

Tom Johnson Thu, 28 Aug 2014 13:44:05 -0700

Another interesting discussion coming out of CMU.

-tj


============================================
Tom Johnson
Institute for Analytic Journalism   --     Santa Fe, NM USA
505.577.6482(c)                                    505.473.9646(h)
Twitter: jtjohnson
slideshare.net/jtjohnson/presentations
 http://www.jtjohnson.com                   [email protected]
============================================


---------- Forwarded message ----------
From: Blogtrottr <[email protected]>
Date: Thu, Aug 28, 2014 at 1:05 PM
Subject: Civil Statistician: “Statistical Modeling: The Two Cultures,”
Breiman
To: [email protected]


    Civil Statistician  Stats, datavis, edu, brains, etc.
<http://www.launchbit.com/ab/1073167098/>  Compare Hotels
<http://www.launchbit.com/ab/1073167098/>

Find great prices for amazing hotels wherever your next destination may be.
It's simple to search 100+ sites at once!  From our sponsors       “Statistical
Modeling: The Two Cultures,” Breiman
<http://civilstat.com/?p=1713&utm_source=rss&utm_medium=rss&utm_campaign=statistical-modeling-the-two-cultures-breiman>
Aug 28th 2014, 18:22, by civilstat

One highlight of my fall semester is going to be a statistics journal club
led by CMU’s Ryan Tibshirani <http://www.stat.cmu.edu/~ryantibs/> together
with his dad Rob Tibshirani <http://statweb.stanford.edu/~tibs/> (here on
sabbatical from Stanford). The journal club will focus on “Hot Ideas in
Statistics <http://www.stat.cmu.edu/~ryantibs/journalclub/>“: some classic
papers that aren’t covered in standard courses, and some newer papers on
hot or developing areas. I’m hoping to find time to blog about several of
the papers we discuss.

The first paper was Leo Breiman’s “Statistical Modeling: The Two Cultures”
<http://www.stat.cmu.edu/~ryantibs/journalclub/breiman_2001.pdf> (2001)
with discussion and rejoinder. This is a very readable, high-level paper
about the culture of statistical education and practice, rather than
about technical details. I strongly encourage you to read it yourself.

Breiman’s article is quite provocative, encouraging statisticians
to downgrade the role of traditional mainstream statistics in favor of a
more machine-learning approach. Breiman calls the two approaches “data
modeling” and “algorithmic modeling”:

   - Data modeling assumes a stochastic model for where the data came
   from: what is the distribution for the data or the random noise, and how do
   you imagine it relates to predictor variables? Then you estimate and
   interpret the model parameters. Breiman claims that common practice is to
   validate your model by goodness-of-fit tests and residual analysis.
   - Algorithmic modeling assumes almost nothing about the data, except
   that it’s usually i.i.d. from the population you want to learn about. You
   don’t start with any statistical distributions or interpretable models;
   just build a “black box” algorithm, like random forests or neural nets, and
   evaluate performance by prediction accuracy (on withheld test data, or by
   cross-validation).

I absolutely agree that traditional statistics focuses on the former over
the latter, and also that the latter has a lot to offer and should be a
welcome addition to any statistician’s toolbox. But Breiman’s tone is
pretty harsh regarding “data modeling,” apart from a few placating remarks
at the end. He uses a few straw man arguments, explaining how algorithmic
modeling beats poorly-done traditional statistics. (For instance, about
overreliance on 5% significance of regression coefficients,
he says “Nowadays, I think most statisticians will agree that this is a
suspect way to arrive at conclusions”—but he is still presenting this
“suspect way” as the standard that “most statisticians” use. So which is
it? Is the majority wrong or right? If by “statisticians” he actually means
“psychologists who took one stats class,” then this calls for a completely
different discussion about education and service courses.) Meanwhile,
Breiman neglects some important benefits of well-done data modeling.

A couple of the discussants (David Cox and Brad Efron) defend the value of
data modeling. (Efron has a great way to rephrase significance tests as
prediction problems: “In our sample of 20 patients drug A outperformed drug
B; would this still be true if we went on to test all possible patients?”)
Another discussant (Bruce Hoadley) shares some examples of early
algorithmic culture from the credit scoring industry, including the
importance of interpretability: “Sometimes we can’t implement them until
the lawyers and regulators approve. And that requires super
interpretability.” The final discussant (Emanuel Parzen) encourages us to
see many cultures besides Breiman’s two: Parzen mentions maximum entropy
methods, robust methods, Bayesian methods, and quantile methods, while I
would add sampling theory as another underappreciated distinct statistical
paradigm.

As for myself, I agree with many of Breiman’s points, especially that
“algorithmic modeling” should be added to our standard applied toolbox and
also become a bigger topic of theoretical study. But I don’t think “data
modeling” is as bad as he implies.

Breiman’s preferred approach is strongly focused on pure prediction
problems: Based on today’s weather, what will the ozone levels be tomorrow?
Can you train a mass spectrometer to predict whether a new unknown compound
contains chlorine? However, there are many scientific problems where the
question is about understanding, not really about prediction. Even if you
can never get really good predictions of who will experience liver
failure and who won’t, you still want to know the approximate effects of
various behaviors on your chance of liver failure. Breiman dismisses the
(nicely interpretable) logistic regression for this problem, suggests a
random forest instead, and shows a nifty way of estimating the relative
“importance” (whatever that means) of each predictor variable. But saying
“variable 12 is more important than variable 10″ seems kind of pointless.
What you want to know is “If you increase your exposure to carcinogen X by
Y units, your risk of disease Z will double,” which is not as easy to
extract from a random forest. Even more so, a data model will give you
confidence intervals, which can be quite useful despite their weaknesses.
Most algorithmic models seem to entirely ignore the concept of confidence
intervals for effect sizes.

Furthermore, there are statistical problems where you cannot really do
prediction in Breiman’s sense. In my work on small area estimation at the
Census Bureau, we often had trouble finding good ways to validate our
models, because it won’t work to do simple cross-validation or withhold a
test set. When your goal is to provide poverty estimates for each of the 50
US states, you can’t just drop some of the states and cross-validate: the
states are not really exchangeable. And you can’t just get more states or
pretend that these 50 are a sample from a larger set of possible states: we
really do care about these 50. Sure, you can imagine various ways to get
around this, including evaluating prediction accuracy on synthetic data (as
we started to do
<http://civilstat.com/portfolio/Wieczorek_ArtificialPop_JSM2013.pdf>). But
my point is that it’s not trivial and you can’t treat everything as a
standard prediction problem with i.i.d. observations.

That leads me to another concern, both in Breiman’s paper and in the
Machine Learning classes I’ve taken here at CMU. The “lack of a generative
data model” basically means that you’re assuming your training data are
taken i.i.d. (or SRS, as a simple random sample) from the population you
want to learn about. Firstly, that IS a kind of generative data model. I’ll
treat this as an admission from Breiman that we’ve established you do need
a data model; now we’re just quibbling over its extent
<http://quoteinvestigator.com/2012/03/07/haggling/> [image: :)]  But
secondly, what *do* Machine Learning people do when the data are not
i.i.d.? If your training and test data aren’t representative of the broader
population, a simple prediction accuracy rate is meaningless. There must be
some research into this, but I’ve hardly seen any. For instance, I still
know of only one paper (Toth & Eltinge
<http://www.bls.gov/osmr/pdf/st100010.pdf>, 2011) on how to do regression
trees when your data come from a complex sample survey.

In class I mentioned that our program has no classes covering nontrivial
experimental design or survey sampling at the PhD level. Surely there would
be interest in learning how to do this well for statistics and machine
learning. My classmate Alex asked if I’m volunteering to teach it [image:
:)] Maybe not a bad idea someday?

In our class discussion, people also pointed out that many of the
“algorithmic” models can be motivated by a statistical model, just like you
can treat many data modeling methods as pure algorithms. It seems clear
that it’s always good to know what implicit model is behind your algorithm.
Then at least you have some hope of checking your model assumptions, even
if it’s imperfect. In general, I think there is still need to develop
better model diagnostics for both data and algorithmic models. I don’t mean
more yes-no goodness of fit tests, but better ways to decide *how* your
model is weak and can be improved. Breiman cites Bill Cleveland admitting
that residual analysis doesn’t help much beyond four or five dimensions,
but that just cries out for more research. Breiman’s examples remind you of
the importance of checking for multicollinearity before you make
interpretations, but that is true of algorithmic modeling too.

Yes, there are gaps in traditional statistics culture’s approach, some of
which algorithmic modeling or machine learning can help to fill. There are
even bigger gaps in our ability to train non-experts to use statistical
models and procedures appropriately. But I doubt that non-experts will make
much better use of random forests or neural nets, even if they could
conceivably have better prediction performance, even where that concept is
relevant. In the end, Breiman makes many valid points, but he does not
convince me to dismiss distributional assumptions and traditional
statistics as a dead end approach.

     You are receiving this email because you subscribed to this feed at
blogtrottr.com.

If you no longer wish to receive these emails, you can unsubscribe from
this feed <https://blogtrottr.com/unsubscribe/bPS/MzplDV>, or manage all
your subscriptions <https://blogtrottr.com/subscriptions/>

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com

[FRIAM] Fwd: Civil Statistician: “Statistical Modeling: The Two Cultures,” Breiman

Reply via email to