Dear Greg, > > > > when training a solubility model (see > > http://code.google.com/p/rdkit/wiki/TrainAThreeClassSolubilityModel > > > > I run into the problem that three different confusion matrices are > > outputted. > > > > I wonder what is the origin of these confusion matrices. Even though x- and > > y-axis might be mixed up, the diagonal entries should always be the same. > > Thus, confusion matrices make me confused... > > nicely put. :-) > > Here's the story: > The first confusion matrix, the one that results from calling: > ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 3, > 0,errorEstimate=True) > contains the out-of-bag predictions of the composite model for your points. > > The next one, which comes from this bit of code: > t = BuildSigTree(pts,nPossibleRes=3,maxDepth=3) > > # simple results report: > confusionMat=numpy.zeros((3,3),numpy.int) > for pt in pts: > confusionMat[pt[-1]][t.ClassifyExample(pt)]+=1 > print confusionMat > > Is actually the confusion matrix for a single decision tree. It's has > essentially no connection to the first one at all.
Thanks for your explanation (I have added those to the Wiki). Now I am a little less confused, but some confusion remains.. I have added the code how I generate "my own confusion matrix" to the Wiki. In my understanding, my function uses the predictions from the out-of-bag prediction. But I guess that I have overlooked some nasty detail. Cheers & Thanks, Paul P.S.: When comparing the results with a PipelinePilot-based Bayesian catagorization model (ECFP_4 & standard settings), I'm surprised to see that the PipelinePilot model is significantly better. I thought that the MorganFingerprints are comparable to the ECFPs and would have assumed that the model quality is in a similar range. > > The last one I can't help with because you don't show the code where > you assign the "solu_class" property to the molecules that go in the > SD file. If I had to guess, and it's just a guess, I would bet that > you calculated it by having the composite generate a prediction for > each point in your training set, but not using the out-of-bag error > estimate. This generates a better confusion matrix since it's testing > using the training set, but it's not a whole lot better since there's > not a huge amount of overfitting going on. > This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://disclaimer.merck.de to access the German, French, Spanish and Portuguese versions of this disclaimer. ------------------------------------------------------------------------------ Got Input? Slashdot Needs You. Take our quick survey online. Come on, we don't ask for help often. Plus, you'll get a chance to win $100 to spend on ThinkGeek. http://p.sf.net/sfu/slashdot-survey _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

