Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via
import matplotlib.pyplot as plt
vs = []
for n in range(5, 201, 5):
v = 1 - (1. - 1./n)**n
vs.append(v)
plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
markersize=6,
alpha=0.5,)
plt.xlabel('n')
plt.ylabel('1 - (1 - 1/n)^n')
plt.xlim([0, 210])
plt.show()
> On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <[email protected]> wrote:
>
> Say the probability that a given sample from a dataset of size n is *not*
> drawn as a bootstrap sample is
>
> P(not_chosen) = (1 - 1\n)^n
>
> Since you have a 1/n chance to draw a particular sample (since bootstrapping
> involves drawing with replacement), which you repeat n times to get a n-sized
> bootstrap sample.
>
> This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n)
>
> Then, you can compute the probability of a sample being chosen as
>
> P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
>
> Best,
> Sebastian
>
>> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn
>> <[email protected]> wrote:
>>
>> Hi,
>>
>> Thank you for the reply. Please bear with me for a while.
>>
>> From where did this number, 0.632, come? I have no background in statistics
>> (which appears to be the case here!). Or let me rephrase my query: what is
>> this bootstrap sampling all about? Searched the web, but didn't get
>> satisfactory results.
>>
>>
>> Thanks
>>
>> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <[email protected]>
>> wrote:
>>> From whatever little knowledge I gained last night about Random Forests,
>>> each tree is trained with a sub-sample of original dataset (usually with
>>> replacement)?.
>>
>> Yes, that should be correct!
>>
>>> Now, what I am not able to understand is - if entire dataset is used to
>>> train each of the trees, then how does the classifier estimates the OOB
>>> error? None of the entries of the dataset is an oob for any of the trees.
>>> (Pardon me if all this sounds BS)
>>
>> If you take an n-size bootstrap sample, where n is the number of samples in
>> your dataset, you have asymptotically 0.632 * n unique samples in your
>> bootstrap set. Or in other words 0.368 * n samples are not used for growing
>> the respective tree (to compute the OOB). As far as I understand, the random
>> forest OOB score is then computed as the average OOB of each tee (correct me
>> if I am wrong!).
>>
>> Best,
>> Sebastian
>>
>>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn
>>> <[email protected]> wrote:
>>>
>>> Dear Developers,
>>>
>>> From whatever little knowledge I gained last night about Random Forests,
>>> each tree is trained with a sub-sample of original dataset (usually with
>>> replacement)?.
>>>
>>> (Note: Please do correct me if I am not making any sense.)
>>>
>>> RandomForestClassifier has an option of 'bootstrap'. The API states the
>>> following
>>>
>>> The sub-sample size is always the same as the original input sample size
>>> but the samples are drawn with replacement if bootstrap=True (default).
>>>
>>> Now, what I am not able to understand is - if entire dataset is used to
>>> train each of the trees, then how does the classifier estimates the OOB
>>> error? None of the entries of the dataset is an oob for any of the trees.
>>> (Pardon me if all this sounds BS)
>>>
>>> Help this mere mortal.
>>>
>>> Thanks
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn