Hi Debu, On 27/12/16 08:18, Andrew Howe wrote: > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 > % on probability thresholds above 0.5
Getting a high True Positive Rate (recall) is not a sufficient condition for a well behaved model. Though 0.1 recall is still pretty bad. You could look at the precision at the same time (or consider, for instance, the F1 score). > 7. I reloaded the model in a different python instance from the > pickle file mentioned above and did my scoring , i.e., used > joblib library load method and then instantiated prediction > (predict_proba method) on the entire set of my original 600 K > records > Another question – is there an alternate model scoring > library (apart from joblib, the one I am using) ? Joblib is not a scoring library; once you load a model from disk with joblib you should get ~ the same RandomForestClassifier estimator object as before saving it. > 8. Now when I am running (scoring) my model using > joblib.predict_proba on the entire set of original data (600 K), > I am getting a True Positive rate of around 80%. That sounds normal, considering what you are doing. Your entire set consists of 80% of training set (for which the recall, I imagine, would be close to 1.0) and 20 % test set (with a recall of 0.1), so on average you would get a recall close to 0.8 for the complete set. Unless I missed something. > 9. I did some further analysis and figured out that during the > training process, when the model was predicting on the test > sample of 120K it could only predict 10-12% of 120K data beyond > a probability threshold of 0.5. When I am now trying to score my > model on the entire set of 600 K records, it appears that the > model is remembering some of it’s past behavior and data and > accordingly throwing 80% True positive rate It feels like your RandomForestClassifier is not properly tuned. A recall of 0.1 on the test set is quite low. It could be worth trying to tune it better (cf. https://stackoverflow.com/a/36109706 ), using some other metric than the recall to evaluate the performance. Roman _______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
