Paul,
On Thu, Jul 28, 2011 at 6:10 PM, <[email protected]> wrote:
>
> when training a solubility model (see
> http://code.google.com/p/rdkit/wiki/TrainAThreeClassSolubilityModel
>
> I run into the problem that three different confusion matrices are
> outputted.
>
> I wonder what is the origin of these confusion matrices. Even though x- and
> y-axis might be mixed up, the diagonal entries should always be the same.
> Thus, confusion matrices make me confused...
nicely put. :-)
Here's the story:
The first confusion matrix, the one that results from calling:
ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 3,
0,errorEstimate=True)
contains the out-of-bag predictions of the composite model for your points.
The next one, which comes from this bit of code:
t = BuildSigTree(pts,nPossibleRes=3,maxDepth=3)
# simple results report:
confusionMat=numpy.zeros((3,3),numpy.int)
for pt in pts:
confusionMat[pt[-1]][t.ClassifyExample(pt)]+=1
print confusionMat
Is actually the confusion matrix for a single decision tree. It's has
essentially no connection to the first one at all.
The last one I can't help with because you don't show the code where
you assign the "solu_class" property to the molecules that go in the
SD file. If I had to guess, and it's just a guess, I would bet that
you calculated it by having the composite generate a prediction for
each point in your training set, but not using the out-of-bag error
estimate. This generates a better confusion matrix since it's testing
using the training set, but it's not a whole lot better since there's
not a huge amount of overfitting going on.
-greg
------------------------------------------------------------------------------
Got Input? Slashdot Needs You.
Take our quick survey online. Come on, we don't ask for help often.
Plus, you'll get a chance to win $100 to spend on ThinkGeek.
http://p.sf.net/sfu/slashdot-survey
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss