Some more questions.

Is it possible to know which features are selected for building a tree? From 
the document [1] max_features can be specified for telling the number of 
features to be randomly selected, but it is still not clear what features are 
used in building a single tree. Or features_importances can be used to check 
what features are selected when building a tree by its values (with that value 
larger than 0)? My header has around 200 columns with sqrt(200) is around 14; 
but checking features importances that has value larger than 0 shows that is 
not the case.       

[2] explains that all trees are equal, and there is no tree weighting in random 
forest. Can I say that's why the predict() function result is obtained through 
the majority prediction because all trees are equal so that no tree's vote is 
more important than others? And the same should apply to proba() function with 
the forest proba() output is the mean of all tree's probability?       

Thanks again for you help.

[1]. 
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.max_features

[2]. 
http://stackoverflow.com/questions/17057139/how-to-find-key-trees-features-from-a-trained-random-forest

----- Mail original -----
De : Gilles Louppe <[email protected]>
À : Aaron Jacques <[email protected]>; 
"[email protected]" 
<[email protected]>
Cc : 
Envoyé le : Mercredi 28 août 2013 7h10
Objet : Re: [Scikit-learn-general] sample_weight and features in a single tree

Hi Aaron,

Assume that X is your data and y is the labels for X. If classes in y
are not balanced and you want to fix that, you can indeed use sample
weights to simulate class weights. Basically you can simply do:

forest.fit(X, y, sample_weight=balance_weights(y))

> In addition, how can I know what features are used for each tree 
> (RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all 
> features for each tree?   For example, a DataFrame with features f=[Age, Job, 
> Title, ...], when calling fit(), each tree will use all features in f? Or any 
> way we can know which features are used for a single tree?

Both random forests and single decision trees are built on *all* the
features that provide in X.

If you want to know which ones were the most helpful/important to
build the forest, then you can check the `feature_importances_`
attribute which will give you a score for each feature (the higher,
the more important).

Hope this helps,

Gilles

On 28 August 2013 12:41, Aaron Jacques <[email protected]> wrote:
>
>
> In SO[1] a thread states that weight class for random forest can be achieved 
> by sample_weight function when executing fit() function. If I have a dataset 
> with format (2 dimension)
>
>
>           categorical_1 numeric categorical_2   ...
> row 1  string_a         182       string_x           ...
> row 2  string_b         12         string_y           ...
> row 3  string_a         3342     string_z           ...
> ...
>
> How can I pass in sample_weight as classes weigh for such cases?  Passing in 
> sample_weight as multiple dimension leads to following error
>   preprocessing.balance_weights([[1,2,3,4,5][1,2,3,4,4]])
>
>   TypeError: list indices must be integers, not tuple
>
>
> Or should I passed in a format like [string_a, string_b, string_a, 182, 12, 
> 3342, string_x ...] with all classes as flat list where string_a is the 
> factor of all classes? Or what is the right way to do that? Or can I just 
> pass in weight for a single tree?
>
> In addition, how can I know what features are used for each tree 
> (RandomForestClassifier.estimators_)? Or RandomForestClassifier uses all 
> features for each tree?   For example, a DataFrame with features f=[Age, Job, 
> Title, ...], when calling fit(), each tree will use all features in f? Or any 
> way we can know which features are used for a single tree?
>
> Thanks
>
> [1]. 
> http://stackoverflow.com/questions/17688147/how-to-weight-classes-in-a-randomforest-implementation
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to