Dear Joel, Andrew and Roman, Thank you very much for your individual feedback ! It's very helpful indeed ! A few more points related to my model execution:
1. By the term "scoring" I meant the process of executing the model once again without retraining it. So , for training the model I used RandomForestClassifer library and for my scoring (execution without retraining) I have used joblib.dump and joblib.load 2. I have used the parameter n_estimator = 5000 while training my model. Besides it , I have used n_jobs = -1 and haven't used any other parameter 3. For my "scoring" activity (executing the model without retraining it) is there an alternate approach to joblib library ? 4. When I execute my scoring job (joblib method) on a dataset , which is completely different to my training dataset then I get similar True Positive Rate and False Positive Rate as of training 5. However, when I execute my scoring job on the same dataset used for training my model then I get very high TPR and FPR. Is there mechanism through which I can visualise the trees created by my RandomForestClassifer algorithm ? While I dumped the model using joblib.dump , there are a bunch of .npy files created. Will those contain the trees ? Thanks in advance ! Cheers, Debu On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman <joel.noth...@gmail.com> wrote: > Your model is overfit to the training data. Not to say that it's > necessarily possible to get a better fit. The default settings for trees > lean towards a tight fit, so you might modify their parameters to increase > regularisation. Still, you should not expect that evaluating a model's > performance on its training data will be indicative of its general > performance. This is why we use held-out test sets and cross-validation. > > On 27 December 2016 at 20:51, Roman Yurchak <rth.yurc...@gmail.com> wrote: > >> Hi Debu, >> >> On 27/12/16 08:18, Andrew Howe wrote: >> > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 >> > % on probability thresholds above 0.5 >> >> Getting a high True Positive Rate (recall) is not a sufficient condition >> for a well behaved model. Though 0.1 recall is still pretty bad. You >> could look at the precision at the same time (or consider, for instance, >> the F1 score). >> >> > 7. I reloaded the model in a different python instance from the >> > pickle file mentioned above and did my scoring , i.e., used >> > joblib library load method and then instantiated prediction >> > (predict_proba method) on the entire set of my original 600 K >> > records >> > Another question – is there an alternate model scoring >> > library (apart from joblib, the one I am using) ? >> >> Joblib is not a scoring library; once you load a model from disk with >> joblib you should get ~ the same RandomForestClassifier estimator object >> as before saving it. >> >> > 8. Now when I am running (scoring) my model using >> > joblib.predict_proba on the entire set of original data (600 K), >> > I am getting a True Positive rate of around 80%. >> >> That sounds normal, considering what you are doing. Your entire set >> consists of 80% of training set (for which the recall, I imagine, would >> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >> average you would get a recall close to 0.8 for the complete set. Unless >> I missed something. >> >> >> > 9. I did some further analysis and figured out that during the >> > training process, when the model was predicting on the test >> > sample of 120K it could only predict 10-12% of 120K data beyond >> > a probability threshold of 0.5. When I am now trying to score my >> > model on the entire set of 600 K records, it appears that the >> > model is remembering some of it’s past behavior and data and >> > accordingly throwing 80% True positive rate >> >> It feels like your RandomForestClassifier is not properly tuned. A >> recall of 0.1 on the test set is quite low. It could be worth trying to >> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >> other metric than the recall to evaluate the performance. >> >> >> Roman >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn