Re: A Classification validation question

Stephen Pendergast Sat, 27 Nov 2004 14:18:09 -0800

No matter how the performance of the model is measured (precision, recall,
MSE, correlation), we always need to measure on the test set, not on the
training set. Performance on the training only tells us that the model
learns what it's supposed to learn. It is not a good indicator of
performance on unseen data. The test set can be obtained using an
independent sample or holdout techniques (cross-validation, leave-one-out).
To meaningfully compare the performance of two algorithms for a given type
of data, we need to compute if a difference in performance is significant.
We also need to compare performance against a baseline (chance or
frequency).




References

http://www.mccombs.utexas.edu/faculty/Maytal.Saar-Tsechansky/Teaching/MIS_373/Fall2004/Model%20Evaluation.ppt

http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf

http://homepages.inf.ed.ac.uk/keller/teaching/internet/lecture_evaluation.pdf

Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill.

Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations. San Diego, CA:
Morgan Kaufmann.



----- Original Message -----
From: "Henry Bulley" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, November 27, 2004 12:28 PM
Subject: A Classification validation question


> Hello,
>
> I recently read that:
> you can't validate the "classification model with the data used to develop
> the model. You must use completely independent data otherwise you bias the
> results.
>
> Is there any resampling approach to address this issue?
> I would be grateful if any of you can point me to some good references or
> studies.
>
> Thanks for your help
>
> Henry
>

Re: A Classification validation question

Reply via email to