Thanks Tom.  This may prove useful for starting arguments at a meeting next 
week!  The people we're meeting with are staunch data modelers, though perhaps 
not in quite the sense used here.  They consistently hear our challenges as 
claims that their approaches are flawed, despite our rather loud advocacy for 
_all_ appropriate methods ... that any solution should flow from the 
requirements set by the problem.  The paper and the synoptic blog post might 
come in handy ... but I won't be holding my breath.


On 08/28/2014 01:42 PM, Tom Johnson wrote:
Another interesting discussion coming out of CMU.

---------- Forwarded message ----------
From: *Blogtrottr* <[email protected] <mailto:[email protected]>>
Date: Thu, Aug 28, 2014 at 1:05 PM
Subject: Civil Statistician: “Statistical Modeling: The Two Cultures,” Breiman


“Statistical Modeling: The Two Cultures,” Breiman 
<http://civilstat.com/?p=1713&utm_source=rss&utm_medium=rss&utm_campaign=statistical-modeling-the-two-cultures-breiman>
Aug 28th 2014, 18:22, by civilstat

One highlight of my fall semester is going to be a statistics journal club led by CMU’s Ryan 
Tibshirani <http://www.stat.cmu.edu/%7Eryantibs/> together with his dad Rob Tibshirani 
<http://statweb.stanford.edu/%7Etibs/> (here on sabbatical from Stanford). The journal club 
will focus on “Hot Ideas in Statistics <http://www.stat.cmu.edu/%7Eryantibs/journalclub/>“: 
some classic papers that aren’t covered in standard courses, and some newer papers on hot or 
developing areas. I’m hoping to find time to blog about several of the papers we discuss.

The first paper was Leo Breiman’s “Statistical Modeling: The Two Cultures” 
<http://www.stat.cmu.edu/%7Eryantibs/journalclub/breiman_2001.pdf> (2001) with 
discussion and rejoinder. This is a very readable, high-level paper about the culture 
of statistical education and practice, rather than about technical details. I 
strongly encourage you to read it yourself.

Breiman’s article is quite provocative, encouraging statisticians to downgrade 
the role of traditional mainstream statistics in favor of a more 
machine-learning approach. Breiman calls the two approaches “data modeling” and 
“algorithmic modeling”:

  * Data modeling assumes a stochastic model for where the data came from: what 
is the distribution for the data or the random noise, and how do you imagine it 
relates to predictor variables? Then you estimate and interpret the model 
parameters. Breiman claims that common practice is to validate your model by 
goodness-of-fit tests and residual analysis.
  * Algorithmic modeling assumes almost nothing about the data, except that 
it’s usually i.i.d. from the population you want to learn about. You don’t 
start with any statistical distributions or interpretable models; just build a 
“black box” algorithm, like random forests or neural nets, and evaluate 
performance by prediction accuracy (on withheld test data, or by 
cross-validation).

I absolutely agree that traditional statistics focuses on the former over the 
latter, and also that the latter has a lot to offer and should be a welcome 
addition to any statistician’s toolbox. But Breiman’s tone is pretty harsh 
regarding “data modeling,” apart from a few placating remarks at the end. He 
uses a few straw man arguments, explaining how algorithmic modeling beats 
poorly-done traditional statistics. (For instance, about overreliance on 5% 
significance of regression coefficients, he says “Nowadays, I think most 
statisticians will agree that this is a suspect way to arrive at 
conclusions”—but he is still presenting this “suspect way” as the standard that 
“most statisticians” use. So which is it? Is the majority wrong or right? If by 
“statisticians” he actually means “psychologists who took one stats class,” 
then this calls for a completely different discussion about education and 
service courses.) Meanwhile, Breiman neglects some important be
nefits of well-done
data modeling.

A couple of the discussants (David Cox and Brad Efron) defend the value of data 
modeling. (Efron has a great way to rephrase significance tests as prediction 
problems: “In our sample of 20 patients drug A outperformed drug B; would this 
still be true if we went on to test all possible patients?”) Another discussant 
(Bruce Hoadley) shares some examples of early algorithmic culture from the 
credit scoring industry, including the importance of interpretability: 
“Sometimes we can’t implement them until the lawyers and regulators approve. 
And that requires super interpretability.” The final discussant (Emanuel 
Parzen) encourages us to see many cultures besides Breiman’s two: Parzen 
mentions maximum entropy methods, robust methods, Bayesian methods, and 
quantile methods, while I would add sampling theory as another underappreciated 
distinct statistical paradigm.

As for myself, I agree with many of Breiman’s points, especially that 
“algorithmic modeling” should be added to our standard applied toolbox and also 
become a bigger topic of theoretical study. But I don’t think “data modeling” 
is as bad as he implies.

Breiman’s preferred approach is strongly focused on pure prediction problems: 
Based on today’s weather, what will the ozone levels be tomorrow? Can you train 
a mass spectrometer to predict whether a new unknown compound contains 
chlorine? However, there are many scientific problems where the question is 
about understanding, not really about prediction. Even if you can never get 
really good predictions of who will experience liver failure and who won’t, you 
still want to know the approximate effects of various behaviors on your chance 
of liver failure. Breiman dismisses the (nicely interpretable) logistic 
regression for this problem, suggests a random forest instead, and shows a 
nifty way of estimating the relative “importance” (whatever that means) of each 
predictor variable. But saying “variable 12 is more important than variable 10″ 
seems kind of pointless. What you want to know is “If you increase your 
exposure to carcinogen X by Y units, your risk of disease Z wi
ll
double,” which is not as easy to extract from a random forest. Even more so, a 
data model will give you confidence intervals, which can be quite useful 
despite their weaknesses. Most algorithmic models seem to entirely ignore the 
concept of confidence intervals for effect sizes.

Furthermore, there are statistical problems where you cannot really do prediction in 
Breiman’s sense. In my work on small area estimation at the Census Bureau, we often 
had trouble finding good ways to validate our models, because it won’t work to do 
simple cross-validation or withhold a test set. When your goal is to provide poverty 
estimates for each of the 50 US states, you can’t just drop some of the states and 
cross-validate: the states are not really exchangeable. And you can’t just get more 
states or pretend that these 50 are a sample from a larger set of possible states: we 
really do care about these 50. Sure, you can imagine various ways to get around this, 
including evaluating prediction accuracy on synthetic data (as we started to do 
<http://civilstat.com/portfolio/Wieczorek_ArtificialPop_JSM2013.pdf>). But my 
point is that it’s not trivial and you can’t treat everything as a standard 
prediction problem with i.i.d. observations.

That leads me to another concern, both in Breiman’s paper and in the Machine Learning classes 
I’ve taken here at CMU. The “lack of a generative data model” basically means that you’re 
assuming your training data are taken i.i.d. (or SRS, as a simple random sample) from the 
population you want to learn about. Firstly, that IS a kind of generative data model. I’ll 
treat this as an admission from Breiman that we’ve established you do need a data model; now 
we’re just quibbling over its extent <http://quoteinvestigator.com/2012/03/07/haggling/> 
:)  But secondly, what *do* Machine Learning people do when the data are not i.i.d.? If your 
training and test data aren’t representative of the broader population, a simple prediction 
accuracy rate is meaningless. There must be some research into this, but I’ve hardly seen any. 
For instance, I still know of only one paper (Toth & Eltinge 
<http://www.bls.gov/osmr/pdf/st100010.pdf>, 2011) on how to do regression trees when your
data
come from a complex sample survey.

In class I mentioned that our program has no classes covering nontrivial 
experimental design or survey sampling at the PhD level. Surely there would be 
interest in learning how to do this well for statistics and machine learning. 
My classmate Alex asked if I’m volunteering to teach it :) Maybe not a bad idea 
someday?

In our class discussion, people also pointed out that many of the “algorithmic” 
models can be motivated by a statistical model, just like you can treat many 
data modeling methods as pure algorithms. It seems clear that it’s always good 
to know what implicit model is behind your algorithm. Then at least you have 
some hope of checking your model assumptions, even if it’s imperfect. In 
general, I think there is still need to develop better model diagnostics for 
both data and algorithmic models. I don’t mean more yes-no goodness of fit 
tests, but better ways to decide *how* your model is weak and can be improved. 
Breiman cites Bill Cleveland admitting that residual analysis doesn’t help much 
beyond four or five dimensions, but that just cries out for more research. 
Breiman’s examples remind you of the importance of checking for 
multicollinearity before you make interpretations, but that is true of 
algorithmic modeling too.

Yes, there are gaps in traditional statistics culture’s approach, some of which 
algorithmic modeling or machine learning can help to fill. There are even 
bigger gaps in our ability to train non-experts to use statistical models and 
procedures appropriately. But I doubt that non-experts will make much better 
use of random forests or neural nets, even if they could conceivably have 
better prediction performance, even where that concept is relevant. In the end, 
Breiman makes many valid points, but he does not convince me to dismiss 
distributional assumptions and traditional statistics as a dead end approach.

You are receiving this email because you subscribed to this feed at blogtrottr.com 
<https://blogtrottr.com/>.

If you no longer wish to receive these emails, you can unsubscribe from this feed 
<https://blogtrottr.com/unsubscribe/bPS/MzplDV>, or manage all your subscriptions 
<https://blogtrottr.com/subscriptions/>




============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com


--
glen e. p. ropella, 971-255-2847, http://tempusdictum.com

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
to unsubscribe http://redfish.com/mailman/listinfo/friam_redfish.com

Reply via email to