Data Mining

Paul Bernhardt Wed, 12 Apr 2000 15:12:20 -0700
I suspect in this forum, almost as bad as the F-word or N-word are the 
DM-words... Data Mining... I agree, but wonder about criteria.

Often in our various research domains we have no choice but to use 
retrospective data. A classic example might be validating an investment 
approach by examining historical data, which some call backtesting. 

What are the criteria, how can we know when we have chance findings?

I've argued that if the model is based on an a priori hypothesis, or can 
be justfied by previously established theories, the possibility of data 
mining may be ignored. When the pre-existing theory is less substantial, 
one may ask if the discovered model fits data not included in the 
original model (data which occurs after the model was discovered, or data 
which precedes the data originally used to create the model).

I'd like to hear the views of people on this forum. 

The specific situation I'm refering to is an investment model called the 
Foolish Four (http://www.fool.com/school/dowinvesting/dowinvesting.htm) 
which was found to beat the S&P500 and Dow 30 over the period from 1973 
through 1993. Since that date, and further backtested to 1961, it has not 
similarly beat those traditional benchmark indexes, but also has not 
performed worse (both of which could be due to lack of power). The 
Foolish Four is based on a reasonable hypothesis that the worse 
performing Dow Jones Industrial Average companies are poised to turn 
around because they are simply too great to fail over the long term. The 
judgement on poor performance is based on the stock yield (a high 
yielding stock has a relatively high interest payment compared to price), 
therefore a reasonable hypothesis is used to justify this approach. 
Selection of 4 of the 5 worst performing Dow companies (the worst is 
excluded because often these companies are in actual long term financial 
trouble) is what makes up the Foolish Four.

I am not affiliated with the Motley Fool (where this investment strategy 
is touted) nor am I advertising for them. It is just an interesting 
practical problem which raises a question I think many statiticians face, 
how to explain when someone has conducted data mining and when they might 
have sussed out a valid truth.

Paul Bernhardt
University of Utah
Department of Educational Psychology


===========================================================================
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================
Data Mining

Reply via email to