I suspect in this forum, almost as bad as the F-word or N-word are the
DM-words... Data Mining... I agree, but wonder about criteria.
Often in our various research domains we have no choice but to use
retrospective data. A classic example might be validating an investment
approach by examining historical data, which some call backtesting.
What are the criteria, how can we know when we have chance findings?
I've argued that if the model is based on an a priori hypothesis, or can
be justfied by previously established theories, the possibility of data
mining may be ignored. When the pre-existing theory is less substantial,
one may ask if the discovered model fits data not included in the
original model (data which occurs after the model was discovered, or data
which precedes the data originally used to create the model).
I'd like to hear the views of people on this forum.
The specific situation I'm refering to is an investment model called the
Foolish Four (http://www.fool.com/school/dowinvesting/dowinvesting.htm)
which was found to beat the S&P500 and Dow 30 over the period from 1973
through 1993. Since that date, and further backtested to 1961, it has not
similarly beat those traditional benchmark indexes, but also has not
performed worse (both of which could be due to lack of power). The
Foolish Four is based on a reasonable hypothesis that the worse
performing Dow Jones Industrial Average companies are poised to turn
around because they are simply too great to fail over the long term. The
judgement on poor performance is based on the stock yield (a high
yielding stock has a relatively high interest payment compared to price),
therefore a reasonable hypothesis is used to justify this approach.
Selection of 4 of the 5 worst performing Dow companies (the worst is
excluded because often these companies are in actual long term financial
trouble) is what makes up the Foolish Four.
I am not affiliated with the Motley Fool (where this investment strategy
is touted) nor am I advertising for them. It is just an interesting
practical problem which raises a question I think many statiticians face,
how to explain when someone has conducted data mining and when they might
have sussed out a valid truth.
Paul Bernhardt
University of Utah
Department of Educational Psychology
===========================================================================
This list is open to everyone. Occasionally, less thoughtful
people send inappropriate messages. Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.
For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================