Hi Guillaume, With respect to the following point you mentioned: You can visualize the trees with sklearn.tree.export_graphviz: http://scikit-learn.org/stable/modules/generated/sklearn.tre e.export_graphviz.html
I couldn't find a direct method for exporting the RandomForestClassifier trees. Accordingly, I attempted for a workaround using the following code but still no success: clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) clf.fit(p_features_train,p_labels_train) for i, tree in enumerate(clf.estimators_): with open('tree_' + str(i) + '.dot', 'w') as dotfile: tree.export_graphviz(clf, dotfile) Would you please be able to help me with the piece of code which I need to execute for exporting the RandomForestClassifier trees. Cheers, Debu On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître <g.lemaitr...@gmail.com > wrote: > On 27 December 2016 at 18:17, Debabrata Ghosh <mailford...@gmail.com> > wrote: > >> Dear Joel, Andrew and Roman, >> Thank you very much >> for your individual feedback ! It's very helpful indeed ! A few more points >> related to my model execution: >> >> 1. By the term "scoring" I meant the process of executing the model once >> again without retraining it. So , for training the model I used >> RandomForestClassifer library and for my scoring (execution without >> retraining) I have used joblib.dump and joblib.load >> > > Go probably with the terms: training, validating, and testing. > This is pretty much standard. Scoring is just the value of a > metric given some data (training data, validation data, or > testing data). > > >> >> 2. I have used the parameter n_estimator = 5000 while training my model. >> Besides it , I have used n_jobs = -1 and haven't used any other parameter >> > > You should probably check those other parameters and understand > what are their effects. You should really check the link of Roman > since GridSearchCV can help you to decide how to fix the parameters. > http://scikit-learn.org/stable/modules/generated/sklearn.model_selection. > GridSearchCV.html#sklearn.model_selection.GridSearchCV > Additionally, 5000 trees seems a lot to me. > > >> >> 3. For my "scoring" activity (executing the model without retraining it) >> is there an alternate approach to joblib library ? >> > > Joblib only store data. There is not link with scoring (Check Roman answer) > > >> >> 4. When I execute my scoring job (joblib method) on a dataset , which is >> completely different to my training dataset then I get similar True >> Positive Rate and False Positive Rate as of training >> > > It is what you should get. > > >> >> 5. However, when I execute my scoring job on the same dataset used for >> training my model then I get very high TPR and FPR. >> > > You are testing on some data which you used while training. Probably, > one of the first rule is to not do that. If you want to evaluate in some > way your classifier, have a separate set (test set) and only test on that > one. As previously mentioned by Roman, 80% of your data are already > known by the RandomForestClassifier and will be perfectly classified. > > >> >> Is there mechanism >> through which I can visualise the trees created by my RandomForestClassifer >> algorithm ? While I dumped the model using joblib.dump , there are a bunch >> of .npy files created. Will those contain the trees ? >> > > You can visualize the trees with sklearn.tree.export_graphviz: > http://scikit-learn.org/stable/modules/generated/ > sklearn.tree.export_graphviz.html > > The bunch of npy are the data needed to load the RandomForestClassifier > which > you previously dumped. > > >> >> Thanks in advance ! >> >> Cheers, >> >> Debu >> >> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.noth...@gmail.com> >> wrote: >> >>> Your model is overfit to the training data. Not to say that it's >>> necessarily possible to get a better fit. The default settings for trees >>> lean towards a tight fit, so you might modify their parameters to increase >>> regularisation. Still, you should not expect that evaluating a model's >>> performance on its training data will be indicative of its general >>> performance. This is why we use held-out test sets and cross-validation. >>> >>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurc...@gmail.com> >>> wrote: >>> >>>> Hi Debu, >>>> >>>> On 27/12/16 08:18, Andrew Howe wrote: >>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>> 10-12 >>>> > % on probability thresholds above 0.5 >>>> >>>> Getting a high True Positive Rate (recall) is not a sufficient condition >>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>> could look at the precision at the same time (or consider, for instance, >>>> the F1 score). >>>> >>>> > 7. I reloaded the model in a different python instance from the >>>> > pickle file mentioned above and did my scoring , i.e., used >>>> > joblib library load method and then instantiated prediction >>>> > (predict_proba method) on the entire set of my original 600 K >>>> > records >>>> > Another question – is there an alternate model scoring >>>> > library (apart from joblib, the one I am using) ? >>>> >>>> Joblib is not a scoring library; once you load a model from disk with >>>> joblib you should get ~ the same RandomForestClassifier estimator object >>>> as before saving it. >>>> >>>> > 8. Now when I am running (scoring) my model using >>>> > joblib.predict_proba on the entire set of original data (600 >>>> K), >>>> > I am getting a True Positive rate of around 80%. >>>> >>>> That sounds normal, considering what you are doing. Your entire set >>>> consists of 80% of training set (for which the recall, I imagine, would >>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>> average you would get a recall close to 0.8 for the complete set. Unless >>>> I missed something. >>>> >>>> >>>> > 9. I did some further analysis and figured out that during the >>>> > training process, when the model was predicting on the test >>>> > sample of 120K it could only predict 10-12% of 120K data >>>> beyond >>>> > a probability threshold of 0.5. When I am now trying to score >>>> my >>>> > model on the entire set of 600 K records, it appears that the >>>> > model is remembering some of it’s past behavior and data and >>>> > accordingly throwing 80% True positive rate >>>> >>>> It feels like your RandomForestClassifier is not properly tuned. A >>>> recall of 0.1 on the test set is quite low. It could be worth trying to >>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >>>> other metric than the recall to evaluate the performance. >>>> >>>> >>>> Roman >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre@inria.f <guillaume.lemai...@inria.fr>r --- > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn