The ‘tree’ name is clashing between the sklearn.tree module and the DecisionTreeClassifier objects in the loop.
You can change the import to from sklearn.tree import export_graphviz and modify the method call accordingly. On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh <mailford...@gmail.com> wrote: > Hi Guillaume, > Thanks for your feedback ! I am > still getting an error, while attempting to print the trees. Here is a > snapshot of my code. I know I may be missing something very silly, but > still wanted to check and see how this works. > > >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) > >>> clf.fit(p_features_train,p_labels_train) > RandomForestClassifier(bootstrap=True, class_weight=None, > criterion='gini', > max_depth=None, max_features='auto', max_leaf_nodes=None, > min_samples_leaf=1, min_samples_split=2, > min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1, > oob_score=False, random_state=None, verbose=0, > warm_start=False) > >>> for idx_tree, tree in enumerate(clf.estimators_): > ... export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) > ... > Traceback (most recent call last): > File "<stdin>", line 2, in <module> > NameError: name 'export_graphviz' is not defined > >>> for idx_tree, tree in enumerate(clf.estimators_): > ... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) > ... > Traceback (most recent call last): > File "<stdin>", line 2, in <module> > AttributeError: 'DecisionTreeClassifier' object has no attribute > 'export_graphviz' > > Just to give you a background about the libraries, I have imported the > following libraries: > > from sklearn.ensemble import RandomForestClassifier > from sklearn import tree > > Thanks again as always ! > > Cheers, > > On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître < > g.lemaitr...@gmail.com> wrote: > >> after the fit you need this call: >> for idx_tree, tree in enumerate(clf.estimators_): >> export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >> >> >> >> On 28 December 2016 at 20:25, Debabrata Ghosh <mailford...@gmail.com> >> wrote: >> >>> Hi Guillaume, >>> With respect to the following point you >>> mentioned: >>> You can visualize the trees with sklearn.tree.export_graphviz: >>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>> e.export_graphviz.html >>> >>> I couldn't find a direct method for exporting the RandomForestClassifier >>> trees. Accordingly, I attempted for a workaround using the following code >>> but still no success: >>> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >>> clf.fit(p_features_train,p_labels_train) >>> for i, tree in enumerate(clf.estimators_): >>> with open('tree_' + str(i) + '.dot', 'w') as dotfile: >>> tree.export_graphviz(clf, dotfile) >>> >>> Would you please be able to help me with the piece of code which I need >>> to execute for exporting the RandomForestClassifier trees. >>> >>> Cheers, >>> >>> Debu >>> >>> >>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître < >>> g.lemaitr...@gmail.com> wrote: >>> >>>> On 27 December 2016 at 18:17, Debabrata Ghosh <mailford...@gmail.com> >>>> wrote: >>>> >>>>> Dear Joel, Andrew and Roman, >>>>> Thank you very >>>>> much for your individual feedback ! It's very helpful indeed ! A few more >>>>> points related to my model execution: >>>>> >>>>> 1. By the term "scoring" I meant the process of executing the model >>>>> once again without retraining it. So , for training the model I used >>>>> RandomForestClassifer library and for my scoring (execution without >>>>> retraining) I have used joblib.dump and joblib.load >>>>> >>>> >>>> Go probably with the terms: training, validating, and testing. >>>> This is pretty much standard. Scoring is just the value of a >>>> metric given some data (training data, validation data, or >>>> testing data). >>>> >>>> >>>>> >>>>> 2. I have used the parameter n_estimator = 5000 while training my >>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other >>>>> parameter >>>>> >>>> >>>> You should probably check those other parameters and understand >>>> what are their effects. You should really check the link of Roman >>>> since GridSearchCV can help you to decide how to fix the parameters. >>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >>>> Additionally, 5000 trees seems a lot to me. >>>> >>>> >>>>> >>>>> 3. For my "scoring" activity (executing the model without retraining >>>>> it) is there an alternate approach to joblib library ? >>>>> >>>> >>>> Joblib only store data. There is not link with scoring (Check Roman >>>> answer) >>>> >>>> >>>>> >>>>> 4. When I execute my scoring job (joblib method) on a dataset , which >>>>> is completely different to my training dataset then I get similar True >>>>> Positive Rate and False Positive Rate as of training >>>>> >>>> >>>> It is what you should get. >>>> >>>> >>>>> >>>>> 5. However, when I execute my scoring job on the same dataset used for >>>>> training my model then I get very high TPR and FPR. >>>>> >>>> >>>> You are testing on some data which you used while training. Probably, >>>> one of the first rule is to not do that. If you want to evaluate in some >>>> way your classifier, have a separate set (test set) and only test on >>>> that >>>> one. As previously mentioned by Roman, 80% of your data are already >>>> known by the RandomForestClassifier and will be perfectly classified. >>>> >>>> >>>>> >>>>> Is there mechanism >>>>> through which I can visualise the trees created by my >>>>> RandomForestClassifer >>>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch >>>>> of .npy files created. Will those contain the trees ? >>>>> >>>> >>>> You can visualize the trees with sklearn.tree.export_graphviz: >>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>>> e.export_graphviz.html >>>> >>>> The bunch of npy are the data needed to load the RandomForestClassifier >>>> which >>>> you previously dumped. >>>> >>>> >>>>> >>>>> Thanks in advance ! >>>>> >>>>> Cheers, >>>>> >>>>> Debu >>>>> >>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.noth...@gmail.com> >>>>> wrote: >>>>> >>>>>> Your model is overfit to the training data. Not to say that it's >>>>>> necessarily possible to get a better fit. The default settings for trees >>>>>> lean towards a tight fit, so you might modify their parameters to >>>>>> increase >>>>>> regularisation. Still, you should not expect that evaluating a model's >>>>>> performance on its training data will be indicative of its general >>>>>> performance. This is why we use held-out test sets and cross-validation. >>>>>> >>>>>> On 27 December 2016 at 20:51, Roman Yurchak <rth.yurc...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Debu, >>>>>>> >>>>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>>>>> 10-12 >>>>>>> > % on probability thresholds above 0.5 >>>>>>> >>>>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>>>> condition >>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>>>> could look at the precision at the same time (or consider, for >>>>>>> instance, >>>>>>> the F1 score). >>>>>>> >>>>>>> > 7. I reloaded the model in a different python instance from >>>>>>> the >>>>>>> > pickle file mentioned above and did my scoring , i.e., used >>>>>>> > joblib library load method and then instantiated prediction >>>>>>> > (predict_proba method) on the entire set of my original >>>>>>> 600 K >>>>>>> > records >>>>>>> > Another question – is there an alternate model >>>>>>> scoring >>>>>>> > library (apart from joblib, the one I am using) ? >>>>>>> >>>>>>> Joblib is not a scoring library; once you load a model from disk with >>>>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>>>> object >>>>>>> as before saving it. >>>>>>> >>>>>>> > 8. Now when I am running (scoring) my model using >>>>>>> > joblib.predict_proba on the entire set of original data >>>>>>> (600 K), >>>>>>> > I am getting a True Positive rate of around 80%. >>>>>>> >>>>>>> That sounds normal, considering what you are doing. Your entire set >>>>>>> consists of 80% of training set (for which the recall, I imagine, >>>>>>> would >>>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>>>> average you would get a recall close to 0.8 for the complete set. >>>>>>> Unless >>>>>>> I missed something. >>>>>>> >>>>>>> >>>>>>> > 9. I did some further analysis and figured out that during >>>>>>> the >>>>>>> > training process, when the model was predicting on the test >>>>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>>>> beyond >>>>>>> > a probability threshold of 0.5. When I am now trying to >>>>>>> score my >>>>>>> > model on the entire set of 600 K records, it appears that >>>>>>> the >>>>>>> > model is remembering some of it’s past behavior and data >>>>>>> and >>>>>>> > accordingly throwing 80% True positive rate >>>>>>> >>>>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>>>> recall of 0.1 on the test set is quite low. It could be worth trying >>>>>>> to >>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using >>>>>>> some >>>>>>> other metric than the recall to evaluate the performance. >>>>>>> >>>>>>> >>>>>>> Roman >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn@python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn@python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn@python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> INRIA Saclay - Ile-de-France >>>> Equipe PARIETAL >>>> guillaume.lemaitre@inria.f <guillaume.lemai...@inria.fr>r --- >>>> https://glemaitre.github.io/ >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Ile-de-France >> Equipe PARIETAL >> guillaume.lemaitre@inria.f <guillaume.lemai...@inria.fr>r --- >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn