Re: [Scikit-learn-general] RandomForests - where do we select a subset of features during fitting?

Ian Ozsvald Mon, 08 Jul 2013 09:24:11 -0700

Hi Peter. re. looping over features in for RandomForests in
_tree.pyx:Tree.find_best_split - yes, I see it now, thanks.


Re. seeing where max_depth is used - cool, I see that now too in
base.py._make_estimator(), thanks.

Re. my question:
>>    I'm interested to learn the lower bound of the number of random
>>    features that can be chosen.
>could you elaborate on that?

I was wondering whether, given just 2 features as we have in the iris demo:
http://scikit-learn.org/dev/auto_examples/ensemble/plot_forest_iris.html
we'd visit a subset of potentially just 1 of the features, or always
2, when building the RF DecisionTrees. The descriptions I'd read in
several books talked about selecting a random subset of the features
but not what the minimum number of features might be.

As I understand it in sklearn:
In _tree.pyx:Tree.find_best_split we break out of the feature testing loop:
            if visited_features >= max_features:
                break
when we've visited enough features. visited_features is 0 at the start
of the for loop. With the iris dataset in the demo we're limiting the
classifiers to 2 features per row in the plot. For RandomForest
max_features is set in __init__ to be "auto":
          - If "auto", then `max_features=sqrt(n_features)`.
so we test a minimum of >1.41 features which means in practice we
check all (both) features for each step in the tree creation process.

As such the RandomForest process tests all (not a random subset) of
the features in plot_forest_iris for the RandomForest and ExtraForest
example columns. That's cool, I just wanted it clear in my mind. If we
had >4 features then we'd start to sample a random subset of the
features.

I'm making notes of the things that weren't clear, I'll probably tidy
them into a bug report for the docs with suggested new wording.

Cheers,
i.

On 7 July 2013 19:49, Peter Prettenhofer <peter.prettenho...@gmail.com> wrote:
> Hi Ian,
>
>
> 2013/7/7 Ian Ozsvald <i...@ianozsvald.com>
>>
>> Hi all. I'm following the RandomForest code (in dev from a 1 week old
>> checkout). As I understand it (and similar to the previous post - I
>> have some RF usage experience but nothing fundamental), RF uses a
>> weighted sample of examples to learn *and* a random subset of features
>> when building its decision trees.
>
>
> correct - although weighted samples are optional - usually, RF takes a
> bootstrap sample and this is implemented via sample_weights (e.g. a sample
> that is picked two times for the bootstrap has weight 2.0)
>>
>>
>> Does the scikit-learn implementation use a random subset of features?
>> I've followed the code in forest.py and I can't find where the choice
>> might be made. I haven't looked at the C code for the DecisionTree.
>
>
> Its in the implementation of DecisionTree - see sklearn/tree/_tree.pyx -
> look for the for loop over ``features``.
>
>>
>>
>> I'm interested to learn the lower bound of the number of random
>> features that can be chosen.
>
>
> could you elaborate on that?
>
>>
>>
>> I'm also curious to understand where we can restrict the depth of the
>> RandomForest classifier. All I can see is that in forest.py the
>> constructor takes but ignores the max_depth argument:
>> class RandomForestClassifier(ForestClassifier):
>> ...
>>     def __init__(self,
>>                  n_estimators=10,
>>                  criterion="gini",
>>                  max_depth=None,
>> ...
>>         super(RandomForestClassifier, self).__init__(
>>             base_estimator=DecisionTreeClassifier(),
>> ...
>>
>> base.py._make_estimator just clones the existing base_estimator. Am I
>> missing something?
>
>
> after cloning it calls ``set_params`` with ``estimator_params`` -
> ``'max_depth'`` is one of those.
>
> best,
>  Peter
>
>>
>>
>> Thanks for listening,
>> Ian.
>>
>> --
>> Ian Ozsvald (A.I. researcher)
>> i...@ianozsvald.com
>>
>> http://IanOzsvald.com
>> http://MorConsulting.com/
>> http://Annotate.IO
>> http://SocialTiesApp.com/
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com/
>> http://twitter.com/IanOzsvald
>> http://ShowMeDo.com
>>
>>
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>>
>> Build for Windows Store.
>>
>> http://p.sf.net/sfu/windows-dev2dev
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Peter Prettenhofer
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Windows:
>
> Build for Windows Store.
>
> http://p.sf.net/sfu/windows-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://Annotate.IO
http://SocialTiesApp.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] RandomForests - where do we select a subset of features during fitting?

Reply via email to