Thanks Naoya ! This has worked and I am able to generate the .dot files. Cheers,
Debu On Thu, Dec 29, 2016 at 10:20 AM, Naoya Kanai <[email protected]> wrote: > The ‘tree’ name is clashing between the sklearn.tree module and the > DecisionTreeClassifier objects in the loop. > > You can change the import to > > from sklearn.tree import export_graphviz > > and modify the method call accordingly. > > > On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh <[email protected]> > wrote: > >> Hi Guillaume, >> Thanks for your feedback ! I am >> still getting an error, while attempting to print the trees. Here is a >> snapshot of my code. I know I may be missing something very silly, but >> still wanted to check and see how this works. >> >> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >> >>> clf.fit(p_features_train,p_labels_train) >> RandomForestClassifier(bootstrap=True, class_weight=None, >> criterion='gini', >> max_depth=None, max_features='auto', max_leaf_nodes=None, >> min_samples_leaf=1, min_samples_split=2, >> min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1, >> oob_score=False, random_state=None, verbose=0, >> warm_start=False) >> >>> for idx_tree, tree in enumerate(clf.estimators_): >> ... export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >> ... >> Traceback (most recent call last): >> File "<stdin>", line 2, in <module> >> NameError: name 'export_graphviz' is not defined >> >>> for idx_tree, tree in enumerate(clf.estimators_): >> ... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >> ... >> Traceback (most recent call last): >> File "<stdin>", line 2, in <module> >> AttributeError: 'DecisionTreeClassifier' object has no attribute >> 'export_graphviz' >> >> Just to give you a background about the libraries, I have imported the >> following libraries: >> >> from sklearn.ensemble import RandomForestClassifier >> from sklearn import tree >> >> Thanks again as always ! >> >> Cheers, >> >> On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lemaître < >> [email protected]> wrote: >> >>> after the fit you need this call: >>> for idx_tree, tree in enumerate(clf.estimators_): >>> export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >>> >>> >>> >>> On 28 December 2016 at 20:25, Debabrata Ghosh <[email protected]> >>> wrote: >>> >>>> Hi Guillaume, >>>> With respect to the following point you >>>> mentioned: >>>> You can visualize the trees with sklearn.tree.export_graphviz: >>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>>> e.export_graphviz.html >>>> >>>> I couldn't find a direct method for exporting the >>>> RandomForestClassifier trees. Accordingly, I attempted for a workaround >>>> using the following code but still no success: >>>> >>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >>>> clf.fit(p_features_train,p_labels_train) >>>> for i, tree in enumerate(clf.estimators_): >>>> with open('tree_' + str(i) + '.dot', 'w') as dotfile: >>>> tree.export_graphviz(clf, dotfile) >>>> >>>> Would you please be able to help me with the piece of code which I need >>>> to execute for exporting the RandomForestClassifier trees. >>>> >>>> Cheers, >>>> >>>> Debu >>>> >>>> >>>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lemaître < >>>> [email protected]> wrote: >>>> >>>>> On 27 December 2016 at 18:17, Debabrata Ghosh <[email protected]> >>>>> wrote: >>>>> >>>>>> Dear Joel, Andrew and Roman, >>>>>> Thank you very >>>>>> much for your individual feedback ! It's very helpful indeed ! A few more >>>>>> points related to my model execution: >>>>>> >>>>>> 1. By the term "scoring" I meant the process of executing the model >>>>>> once again without retraining it. So , for training the model I used >>>>>> RandomForestClassifer library and for my scoring (execution without >>>>>> retraining) I have used joblib.dump and joblib.load >>>>>> >>>>> >>>>> Go probably with the terms: training, validating, and testing. >>>>> This is pretty much standard. Scoring is just the value of a >>>>> metric given some data (training data, validation data, or >>>>> testing data). >>>>> >>>>> >>>>>> >>>>>> 2. I have used the parameter n_estimator = 5000 while training my >>>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other >>>>>> parameter >>>>>> >>>>> >>>>> You should probably check those other parameters and understand >>>>> what are their effects. You should really check the link of Roman >>>>> since GridSearchCV can help you to decide how to fix the parameters. >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >>>>> Additionally, 5000 trees seems a lot to me. >>>>> >>>>> >>>>>> >>>>>> 3. For my "scoring" activity (executing the model without retraining >>>>>> it) is there an alternate approach to joblib library ? >>>>>> >>>>> >>>>> Joblib only store data. There is not link with scoring (Check Roman >>>>> answer) >>>>> >>>>> >>>>>> >>>>>> 4. When I execute my scoring job (joblib method) on a dataset , which >>>>>> is completely different to my training dataset then I get similar True >>>>>> Positive Rate and False Positive Rate as of training >>>>>> >>>>> >>>>> It is what you should get. >>>>> >>>>> >>>>>> >>>>>> 5. However, when I execute my scoring job on the same dataset used >>>>>> for training my model then I get very high TPR and FPR. >>>>>> >>>>> >>>>> You are testing on some data which you used while training. Probably, >>>>> one of the first rule is to not do that. If you want to evaluate in >>>>> some >>>>> way your classifier, have a separate set (test set) and only test on >>>>> that >>>>> one. As previously mentioned by Roman, 80% of your data are already >>>>> known by the RandomForestClassifier and will be perfectly classified. >>>>> >>>>> >>>>>> >>>>>> Is there mechanism >>>>>> through which I can visualise the trees created by my >>>>>> RandomForestClassifer >>>>>> algorithm ? While I dumped the model using joblib.dump , there are a >>>>>> bunch >>>>>> of .npy files created. Will those contain the trees ? >>>>>> >>>>> >>>>> You can visualize the trees with sklearn.tree.export_graphviz: >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>>>> e.export_graphviz.html >>>>> >>>>> The bunch of npy are the data needed to load the >>>>> RandomForestClassifier which >>>>> you previously dumped. >>>>> >>>>> >>>>>> >>>>>> Thanks in advance ! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Debu >>>>>> >>>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Your model is overfit to the training data. Not to say that it's >>>>>>> necessarily possible to get a better fit. The default settings for trees >>>>>>> lean towards a tight fit, so you might modify their parameters to >>>>>>> increase >>>>>>> regularisation. Still, you should not expect that evaluating a model's >>>>>>> performance on its training data will be indicative of its general >>>>>>> performance. This is why we use held-out test sets and cross-validation. >>>>>>> >>>>>>> On 27 December 2016 at 20:51, Roman Yurchak <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Debu, >>>>>>>> >>>>>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>>>>> > 5. I got a prediction result with True Positive Rate (TPR) >>>>>>>> as 10-12 >>>>>>>> > % on probability thresholds above 0.5 >>>>>>>> >>>>>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>>>>> condition >>>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>>>>> could look at the precision at the same time (or consider, for >>>>>>>> instance, >>>>>>>> the F1 score). >>>>>>>> >>>>>>>> > 7. I reloaded the model in a different python instance from >>>>>>>> the >>>>>>>> > pickle file mentioned above and did my scoring , i.e., >>>>>>>> used >>>>>>>> > joblib library load method and then instantiated >>>>>>>> prediction >>>>>>>> > (predict_proba method) on the entire set of my original >>>>>>>> 600 K >>>>>>>> > records >>>>>>>> > Another question – is there an alternate model >>>>>>>> scoring >>>>>>>> > library (apart from joblib, the one I am using) ? >>>>>>>> >>>>>>>> Joblib is not a scoring library; once you load a model from disk >>>>>>>> with >>>>>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>>>>> object >>>>>>>> as before saving it. >>>>>>>> >>>>>>>> > 8. Now when I am running (scoring) my model using >>>>>>>> > joblib.predict_proba on the entire set of original data >>>>>>>> (600 K), >>>>>>>> > I am getting a True Positive rate of around 80%. >>>>>>>> >>>>>>>> That sounds normal, considering what you are doing. Your entire set >>>>>>>> consists of 80% of training set (for which the recall, I imagine, >>>>>>>> would >>>>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>>>>> average you would get a recall close to 0.8 for the complete set. >>>>>>>> Unless >>>>>>>> I missed something. >>>>>>>> >>>>>>>> >>>>>>>> > 9. I did some further analysis and figured out that during >>>>>>>> the >>>>>>>> > training process, when the model was predicting on the >>>>>>>> test >>>>>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>>>>> beyond >>>>>>>> > a probability threshold of 0.5. When I am now trying to >>>>>>>> score my >>>>>>>> > model on the entire set of 600 K records, it appears that >>>>>>>> the >>>>>>>> > model is remembering some of it’s past behavior and data >>>>>>>> and >>>>>>>> > accordingly throwing 80% True positive rate >>>>>>>> >>>>>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>>>>> recall of 0.1 on the test set is quite low. It could be worth >>>>>>>> trying to >>>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using >>>>>>>> some >>>>>>>> other metric than the recall to evaluate the performance. >>>>>>>> >>>>>>>> >>>>>>>> Roman >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> [email protected] >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> [email protected] >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> [email protected] >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> INRIA Saclay - Ile-de-France >>>>> Equipe PARIETAL >>>>> [email protected] <[email protected]>r --- >>>>> https://glemaitre.github.io/ >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> [email protected] >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> [email protected] >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> INRIA Saclay - Ile-de-France >>> Equipe PARIETAL >>> [email protected] <[email protected]>r --- >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> [email protected] >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
