This classical paper on statistical practices (Breiman's "two cultures") might be helpful to understand the different viewpoints:

https://projecteuclid.org/euclid.ss/1009213726


On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:

    As far as I understand: Holding out a test set is recommended if
    you aren't entirely sure that the assumptions of the model are
    held (gaussian error on a linear fit; independent and identically
    distributed samples). The model evaluation approach in predictive
    ML, using held-out data, relies only on the weaker assumption that
    the metric you have chosen, when applied to the test set you have
    held out, forms a reasonable measure of generalised / real-world
    performance. (Of course this too is often not held in practice,
    but it is the primary assumption, in my opinion, that ML
    practitioners need to be careful of.)


Dear CW,
As Joel as said, holding out a test set will help you evaluate the validity of model assumptions, and his last point (reasonable measure of generalised performance) is absolutely essential for understanding the capabilities and limitations of ML.

To add to your checklist of interpreting ML papers properly, be cautious when interpreting reports of high performance when using 5/10-fold or Leave-One-Out cross-validation on large datasets, where "large" depends on the nature of the problem setting. Results are also highly dependent on the distributions of the underlying independent variables (e.g., 60000 datapoints all with near-identical distributions may yield phenomenal performance in cross validation and be almost non-predictive in truly unknown/prospective situations). Even at 500 datapoints, if independent variable distributions look similar (with similar endpoints), then when each model is trained on 80% of that data, the remaining 20% will certainly be predictable, and repeating that five times will yield statistics that seem impressive.

So, again, while problem context completely dictates ML experiment design, metric selection, and interpretation of outcome, my personal rule of thumb is to do no-more than 2-fold cross-validation (50% train, 50% predict) when having 100+ datapoints. Even more extreme, using try 33% for training and 66% for validation (or even 20/80). If your model still reports good statistics, then you can believe that the patterns in the training data extrapolate well to the ones in the external validation data.

Hope this helps,
J.B.




_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to