Re: [scikit-learn] Random Forest with Bootstrapping

Sebastian Raschka Mon, 03 Oct 2016 12:22:44 -0700

Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via


import matplotlib.pyplot as plt

vs = []
for n in range(5, 201, 5):
    v = 1 - (1. - 1./n)**n
    vs.append(v)
 
plt.plot([n for n in range(5, 201, 5)], vs, marker='o', 
          markersize=6, 
          alpha=0.5,)

plt.xlabel('n')
plt.ylabel('1 - (1 - 1/n)^n')
plt.xlim([0, 210])
plt.show()

> On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <[email protected]> wrote:
> 
> Say the probability that a given sample from a dataset of size n is *not* 
> drawn as a bootstrap sample is
> 
> P(not_chosen) = (1 - 1\n)^n
> 
> Since you have a 1/n chance to draw a particular sample (since bootstrapping 
> involves drawing with replacement), which you repeat n times to get a n-sized 
> bootstrap sample.
> 
> This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n)
> 
> Then, you can compute the probability of a sample being chosen as
> 
> P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 
> 
> Best,
> Sebastian
> 
>> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn 
>> <[email protected]> wrote:
>> 
>> Hi,
>> 
>> Thank you for the reply. Please bear with me for a while.
>> 
>> From where did this number, 0.632, come? I have no background in statistics 
>> (which appears to be the case here!). Or let me rephrase my query: what is 
>> this bootstrap sampling all about? Searched the web, but didn't get 
>> satisfactory results.
>> 
>> 
>> Thanks
>> 
>> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <[email protected]> 
>> wrote:
>>> From whatever little knowledge I gained last night about Random Forests, 
>>> each tree is trained with a sub-sample of original dataset (usually with 
>>> replacement)?.
>> 
>> Yes, that should be correct!
>> 
>>> Now, what I am not able to understand is - if entire dataset is used to 
>>> train each of the trees, then how does the classifier estimates the OOB 
>>> error? None of the entries of the dataset is an oob for any of the trees. 
>>> (Pardon me if all this sounds BS)
>> 
>> If you take an n-size bootstrap sample, where n is the number of samples in 
>> your dataset, you have asymptotically 0.632 * n unique samples in your 
>> bootstrap set. Or in other words 0.368 * n samples are not used for growing 
>> the respective tree (to compute the OOB). As far as I understand, the random 
>> forest OOB score is then computed as the average OOB of each tee (correct me 
>> if I am wrong!).
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn 
>>> <[email protected]> wrote:
>>> 
>>> Dear Developers,
>>> 
>>> From whatever little knowledge I gained last night about Random Forests, 
>>> each tree is trained with a sub-sample of original dataset (usually with 
>>> replacement)?.
>>> 
>>> (Note: Please do correct me if I am not making any sense.)
>>> 
>>> RandomForestClassifier has an option of 'bootstrap'. The API states the 
>>> following
>>> 
>>> The sub-sample size is always the same as the original input sample size 
>>> but the samples are drawn with replacement if bootstrap=True (default).
>>> 
>>> Now, what I am not able to understand is - if entire dataset is used to 
>>> train each of the trees, then how does the classifier estimates the OOB 
>>> error? None of the entries of the dataset is an oob for any of the trees. 
>>> (Pardon me if all this sounds BS)
>>> 
>>> Help this mere mortal.
>>> 
>>> Thanks
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Random Forest with Bootstrapping

Reply via email to