Your model is overfit to the training data. Not to say that it's necessarily possible to get a better fit. The default settings for trees lean towards a tight fit, so you might modify their parameters to increase regularisation. Still, you should not expect that evaluating a model's performance on its training data will be indicative of its general performance. This is why we use held-out test sets and cross-validation.
On 27 December 2016 at 20:51, Roman Yurchak <rth.yurc...@gmail.com> wrote: > Hi Debu, > > On 27/12/16 08:18, Andrew Howe wrote: > > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 > > % on probability thresholds above 0.5 > > Getting a high True Positive Rate (recall) is not a sufficient condition > for a well behaved model. Though 0.1 recall is still pretty bad. You > could look at the precision at the same time (or consider, for instance, > the F1 score). > > > 7. I reloaded the model in a different python instance from the > > pickle file mentioned above and did my scoring , i.e., used > > joblib library load method and then instantiated prediction > > (predict_proba method) on the entire set of my original 600 K > > records > > Another question – is there an alternate model scoring > > library (apart from joblib, the one I am using) ? > > Joblib is not a scoring library; once you load a model from disk with > joblib you should get ~ the same RandomForestClassifier estimator object > as before saving it. > > > 8. Now when I am running (scoring) my model using > > joblib.predict_proba on the entire set of original data (600 K), > > I am getting a True Positive rate of around 80%. > > That sounds normal, considering what you are doing. Your entire set > consists of 80% of training set (for which the recall, I imagine, would > be close to 1.0) and 20 % test set (with a recall of 0.1), so on > average you would get a recall close to 0.8 for the complete set. Unless > I missed something. > > > > 9. I did some further analysis and figured out that during the > > training process, when the model was predicting on the test > > sample of 120K it could only predict 10-12% of 120K data beyond > > a probability threshold of 0.5. When I am now trying to score my > > model on the entire set of 600 K records, it appears that the > > model is remembering some of it’s past behavior and data and > > accordingly throwing 80% True positive rate > > It feels like your RandomForestClassifier is not properly tuned. A > recall of 0.1 on the test set is quite low. It could be worth trying to > tune it better (cf. https://stackoverflow.com/a/36109706 ), using some > other metric than the recall to evaluate the performance. > > > Roman > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn