This classical paper on statistical practices (Breiman's "two cultures")
might be helpful to understand the different viewpoints:
https://projecteuclid.org/euclid.ss/1009213726
On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
As far as I understand: Holding out a test set is recommended if
you aren't entirely sure that the assumptions of the model are
held (gaussian error on a linear fit; independent and identically
distributed samples). The model evaluation approach in predictive
ML, using held-out data, relies only on the weaker assumption that
the metric you have chosen, when applied to the test set you have
held out, forms a reasonable measure of generalised / real-world
performance. (Of course this too is often not held in practice,
but it is the primary assumption, in my opinion, that ML
practitioners need to be careful of.)
Dear CW,
As Joel as said, holding out a test set will help you evaluate the
validity of model assumptions, and his last point (reasonable measure
of generalised performance) is absolutely essential for understanding
the capabilities and limitations of ML.
To add to your checklist of interpreting ML papers properly, be
cautious when interpreting reports of high performance when using
5/10-fold or Leave-One-Out cross-validation on large datasets, where
"large" depends on the nature of the problem setting.
Results are also highly dependent on the distributions of the
underlying independent variables (e.g., 60000 datapoints all with
near-identical distributions may yield phenomenal performance in cross
validation and be almost non-predictive in truly unknown/prospective
situations).
Even at 500 datapoints, if independent variable distributions look
similar (with similar endpoints), then when each model is trained on
80% of that data, the remaining 20% will certainly be predictable, and
repeating that five times will yield statistics that seem impressive.
So, again, while problem context completely dictates ML experiment
design, metric selection, and interpretation of outcome, my personal
rule of thumb is to do no-more than 2-fold cross-validation (50%
train, 50% predict) when having 100+ datapoints.
Even more extreme, using try 33% for training and 66% for validation
(or even 20/80).
If your model still reports good statistics, then you can believe that
the patterns in the training data extrapolate well to the ones in the
external validation data.
Hope this helps,
J.B.
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn