Hi All,
This is my understanding of the Random Forest Algorithm :
Random Forest algorithm creates number of trees using randomly selected subset
of samples and features. At each node of the tree it uses the Gini information
gain
to find the best feature-threshold (various threshold is tested for each
feature) pair to obtain the best separation between the positive and the
negative class.
Question 1 :
I have a two class classification problem where the positive labels
reside in clusters. A traditional cross validation approach is not aware of
this issue and splits data
points from a cluster in to training and test set giving rise to strong
classification performance. I wrote a custom cross validation loop to address
this issue. However
the bootstrapping method inside the Random Forest algorithm randomly
selects samples and features and controls for overfitting.
When it applies the fit method on randomly selected samples, does it do
an internal cross validation to prevent overfitting ? I did not find this in
the github code.
If yes, Can I specify my groupings to Random Forest ?
Question 2 :
Gini impurity at each node tries to find the best separation between
two classes. I care more about obtaining a cleaner separation for my positive
class. Is there
any way to give importance to one class during the partitioning.
Thanks in advance.
Mamun
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general