Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Andreas Mueller Mon, 03 Jun 2019 08:43:31 -0700

This classical paper on statistical practices (Breiman's "two cultures")might be helpful to understand the different viewpoints:


https://projecteuclid.org/euclid.ss/1009213726



On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:

    As far as I understand: Holding out a test set is recommended if
    you aren't entirely sure that the assumptions of the model are
    held (gaussian error on a linear fit; independent and identically
    distributed samples). The model evaluation approach in predictive
    ML, using held-out data, relies only on the weaker assumption that
    the metric you have chosen, when applied to the test set you have
    held out, forms a reasonable measure of generalised / real-world
    performance. (Of course this too is often not held in practice,
    but it is the primary assumption, in my opinion, that ML
    practitioners need to be careful of.)


Dear CW,
As Joel as said, holding out a test set will help you evaluate thevalidity of model assumptions, and his last point (reasonable measureof generalised performance) is absolutely essential for understandingthe capabilities and limitations of ML.
To add to your checklist of interpreting ML papers properly, becautious when interpreting reports of high performance when using5/10-fold or Leave-One-Out cross-validation on large datasets, where"large" depends on the nature of the problem setting.Results are also highly dependent on the distributions of theunderlying independent variables (e.g., 60000 datapoints all withnear-identical distributions may yield phenomenal performance in crossvalidation and be almost non-predictive in truly unknown/prospectivesituations).Even at 500 datapoints, if independent variable distributions looksimilar (with similar endpoints), then when each model is trained on80% of that data, the remaining 20% will certainly be predictable, andrepeating that five times will yield statistics that seem impressive.
So, again, while problem context completely dictates ML experimentdesign, metric selection, and interpretation of outcome, my personalrule of thumb is to do no-more than 2-fold cross-validation (50%train, 50% predict) when having 100+ datapoints.Even more extreme, using try 33% for training and 66% for validation(or even 20/80).If your model still reports good statistics, then you can believe thatthe patterns in the training data extrapolate well to the ones in theexternal validation data.
Hope this helps,
J.B.




_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Reply via email to