Hi, So why is using a bootstrap sample of size n better than just a random set of size 0.62*n in Random Forest?
Thanks On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Originally, it was this technique was used to estimate a sampling > distribution. Think of the drawing with replacement as work-around for > generating *new* data from a population that is simulated by this repeated > sampling from the given dataset with replacement. > > > For more details, I’d recommend reading the original literature, e.g,. > > Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” > The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1–26. > > > There’s also a whole book on this topic: > > Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the > Bootstrap. Chapman & Hall. > > > Or more relevant to this particular application, maybe see > > Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140. > > "Tests on real and simulated data sets using classification and regression > trees and subset selection in linear regression show that bagging can give > substantial gains in accuracy. The vital element is the instability of the > prediction method. If perturbing the learning set can cause significant > changes in the predictor constructed, then bagging can improve accuracy." > > > > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn < > scikit-learn@python.org> wrote: > > > > So what is the point of having duplicate entries in your training set? > This seems just a pure overhead. Sorry but you will again have to help me > here. > > > > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > > Hi, > > > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt > though. > > > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we > have an original dataset of size 100. A bootstrap sample (say, B) of size > 100 is drawn from this set. Since 32 of the original samples are left out > (theoretically at least), some of the samples in B must be repeated? > > > > Yeah, you'll definitely have duplications, that’s why (if you have an > infinitely large n) only 0.632*n samples are unique ;). > > > > Say your dataset is > > > > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices > of your data points) > > > > then a bootstrap sample could be > > > > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently > [2, 3, 6, 8] > > > > > > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn < > scikit-learn@python.org> wrote: > > > > > > Hi, > > > > > > That helped a lot. Thank you very much. I have one more (silly?) doubt > though. > > > > > > Won't an n-sized bootstrapped sample have repeated entries? Say we > have an original dataset of size 100. A bootstrap sample (say, B) of size > 100 is drawn from this set. Since 32 of the original samples are left out > (theoretically at least), some of the samples in B must be repeated? > > > > > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka < > se.rasc...@gmail.com> wrote: > > > Or maybe more intuitively, you can visualize this asymptotic behavior > e.g., via > > > > > > import matplotlib.pyplot as plt > > > > > > vs = [] > > > for n in range(5, 201, 5): > > > v = 1 - (1. - 1./n)**n > > > vs.append(v) > > > > > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o', > > > markersize=6, > > > alpha=0.5,) > > > > > > plt.xlabel('n') > > > plt.ylabel('1 - (1 - 1/n)^n') > > > plt.xlim([0, 210]) > > > plt.show() > > > > > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > > > > > > > Say the probability that a given sample from a dataset of size n is > *not* drawn as a bootstrap sample is > > > > > > > > P(not_chosen) = (1 - 1\n)^n > > > > > > > > Since you have a 1/n chance to draw a particular sample (since > bootstrapping involves drawing with replacement), which you repeat n times > to get a n-sized bootstrap sample. > > > > > > > > This is asymptotically "1/e approx. 0.368” (i.e., for very, very > large n) > > > > > > > > Then, you can compute the probability of a sample being chosen as > > > > > > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > > > > > > > Best, > > > > Sebastian > > > > > > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn < > scikit-learn@python.org> wrote: > > > >> > > > >> Hi, > > > >> > > > >> Thank you for the reply. Please bear with me for a while. > > > >> > > > >> From where did this number, 0.632, come? I have no background in > statistics (which appears to be the case here!). Or let me rephrase my > query: what is this bootstrap sampling all about? Searched the web, but > didn't get satisfactory results. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka < > se.rasc...@gmail.com> wrote: > > > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > > > >> > > > >> Yes, that should be correct! > > > >> > > > >>> Now, what I am not able to understand is - if entire dataset is > used to train each of the trees, then how does the classifier estimates the > OOB error? None of the entries of the dataset is an oob for any of the > trees. (Pardon me if all this sounds BS) > > > >> > > > >> If you take an n-size bootstrap sample, where n is the number of > samples in your dataset, you have asymptotically 0.632 * n unique samples > in your bootstrap set. Or in other words 0.368 * n samples are not used for > growing the respective tree (to compute the OOB). As far as I understand, > the random forest OOB score is then computed as the average OOB of each tee > (correct me if I am wrong!). > > > >> > > > >> Best, > > > >> Sebastian > > > >> > > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn < > scikit-learn@python.org> wrote: > > > >>> > > > >>> Dear Developers, > > > >>> > > > >>> From whatever little knowledge I gained last night about Random > Forests, each tree is trained with a sub-sample of original dataset > (usually with replacement)?. > > > >>> > > > >>> (Note: Please do correct me if I am not making any sense.) > > > >>> > > > >>> RandomForestClassifier has an option of 'bootstrap'. The API > states the following > > > >>> > > > >>> The sub-sample size is always the same as the original input > sample size but the samples are drawn with replacement if bootstrap=True > (default). > > > >>> > > > >>> Now, what I am not able to understand is - if entire dataset is > used to train each of the trees, then how does the classifier estimates the > OOB error? None of the entries of the dataset is an oob for any of the > trees. (Pardon me if all this sounds BS) > > > >>> > > > >>> Help this mere mortal. > > > >>> > > > >>> Thanks > > > >>> _______________________________________________ > > > >>> scikit-learn mailing list > > > >>> scikit-learn@python.org > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn@python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn@python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn@python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn