Hi all, As everyone knows sklearn is excellent for building predictive models, but an area where I believe there is still work to be done is in coming up with measurements for the inherent uncertainties in those models. (That there is an appetite for this is I believe evidenced by the rise in popularity of probabilistic programming.) We can, for example, easily find point estimates for coefficients of linear models in sklearn, but making inferences from those point estimates is not possible without measurements of probable error.
To address this and other problems I authored a package called resample which implements the bootstrap and other randomization-based procedures with the goal of performing largely nonparametric statistical inference on a wide class of problems. The package is built entirely in numpy and scipy and so already integrates fairly well with sklearn (there is a tutorial here which among other things shows applications using the Boston housing data: https://github.com/dsaxton/resample/blob/master/doc/resample.ipynb) Might there be interest in including something like this as an sklearn-contrib package? Essentially we are taking what is already in sklearn.utils.resample and extending it to include other forms of the bootstrap (e.g., balanced, parametric, stratified and / or smoothed), algorithms for computing automatic confidence intervals, and procedures for doing nonparametric, randomization-based hypothesis testing. Here is the Github page: https://github.com/dsaxton/resample Of course, I also would greatly appreciate any input that others might have on ways that this package could be made more useful. Thanks, Daniel
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn