Hi Debu "Should I be using 2 different input datasets (completely exclusive / disjoint) for training and scoring the models ?" Yes - this is the reason for partitioning the data into training / testing sets. However, I can't imagine that it's the cause of your odd results. What is the total classification result in both training & testing (not just TPs)?
Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Tue, Dec 27, 2016 at 8:26 AM, Debabrata Ghosh <[email protected]> wrote: > Hi Joel, > > Thanks for your quick feedback – I certainly understand > what you mean and please allow me to explain one more time through a > sequence of steps corresponding to the approach I followed: > > > > 1. I considered a dataset containing 600 K (0.6 million) records for > training my model using scikit learn’s Random Forest Classifier library > > > > 1. I did a training and test sample split on 600 k – forming 480 K > training dataset and 120 K test dataset (80:20 split) > > > > 1. I trained scikit learn’s Random Forest Classifier model on the 480 > K (80% split) training sample > > > > 1. Then I ran prediction (predict_proba method of scikit learn’s RF > library) on the 120 K test sample > > > > 1. I got a prediction result with True Positive Rate (TPR) as 10-12 % > on probability thresholds above 0.5 > > > > 1. I saved the above Random Forest Classifier model using scikit > learn’s joblib library (dump method) in the form of a pickle file > > > > 1. I reloaded the model in a different python instance from the pickle > file mentioned above and did my scoring , i.e., used joblib library load > method and then instantiated prediction (predict_proba method) on the > entire set of my original 600 K records > > > > 1. Now when I am running (scoring) my model using joblib.predict_proba > on the entire set of original data (600 K), I am getting a True Positive > rate of around 80%. > > > > 1. I did some further analysis and figured out that during the > training process, when the model was predicting on the test sample of 120K > it could only predict 10-12% of 120K data beyond a probability threshold of > 0.5. When I am now trying to score my model on the entire set of 600 K > records, it appears that the model is remembering some of it’s past > behavior and data and accordingly throwing 80% True positive rate > > > > 1. When I tried to score the model using joblib.predict_proba on a > completely disjoint dataset from the one used for training (i.e., no > overlap between training and scoring data) then it’s giving me the right > True Positive Rate (in the range of 10 – 12%) > > *Here lies my question once again:* Should I be using 2 > different input datasets (completely exclusive / disjoint) for training and > scoring the models ? In case the input datasets for scoring and training > overlaps then I get incorrect results. Will that be a fair assumption ? > > Another question – is there an alternate model scoring library > (apart from joblib, the one I am using) ? > > > Thanks once again for your feedback in advance ! > > > Cheers, > > > Debu > > On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman <[email protected]> > wrote: > >> Hi Debu, >> >> Your post is terminologically confusing, so I'm not sure I've understood >> your problem. Where is the "different sample" used for scoring coming from? >> Is it possible it is more related to the training data than the test sample? >> >> Joel >> >> On 27 December 2016 at 05:28, Debabrata Ghosh <[email protected]> >> wrote: >> >>> Dear All, >>> >>> Greetings! >>> >>> I need some urgent guidance and help >>> from you all in model scoring. What I mean by model scoring is around the >>> following steps: >>> >>> >>> >>> 1. I have trained a Random Classifier model using scikit-learn >>> (RandomForestClassifier library) >>> 2. Then I have generated the True Positive and False Positive >>> predictions on my test data set using predict_proba method (I have >>> splitted >>> my data into training and test samples in 80:20 ratio) >>> 3. Finally, I have dumped the model into a pkl file. >>> 4. Next in another instance, I have loaded the .pkl file >>> 5. I have initiated job_lib.predict_proba method for predicting the >>> True Positive and False positives on a different sample. I am terming >>> this >>> step as scoring whether I am predicting without retraining the model >>> >>> My question is when I generate the True Positive Rate >>> on the test data set (as part of model training approach), the rate which I >>> am getting is 10 – 12%. But when I do the scoring (using the steps >>> mentioned above), my True Positive Rate is shooting high upto 80%. >>> Although, I am happy to get a very high TPR but my question is whether >>> getting such a high TPR during the scoring phase is an expected outcome? In >>> other words, whether achieving a high TPR through joblib is an accepted >>> outcome vis-à-vis getting the TPR on training / test data set. >>> >>> Your views on the above ask will be really helpful as I >>> am very confused whether to consider scoring the model using joblib. >>> Otherwise is there any other alternative to joblib, which can help me to do >>> scoring without retraining the model. Please let me know as per your >>> earliest convenience as am a bit pressed >>> >>> >>> >>> Thanks for your help in advance! >>> >>> >>> >>> Cheers, >>> >>> Debu >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> [email protected] >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> [email protected] >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
