Dear JP,

Firstly, sorry for the delayed reply..

You appear to be on to something. I played around with your code and the
classifier and after discussing it with @Gael, it seems clear that there's
a pig in the truffle-patch here.

It seems that in the case you sample code, the problem occurs when no
features are present in one of the classes. It also occurs when you take
for example X_2 = X + 2 and use that as the data.

It seems that it might be something to do around the workings of the
_count(X, Y) method.
upon checking `clf.feature_log_prob',
array([[-1.60943791, -1.60943791, -1.60943791, -1.60943791, -1.60943791],
       [-1.60943791, -1.60943791, -1.60943791, -1.60943791, -1.60943791]])
which doesn't look good..

I looked a round here a while, but I don't really have enough knowledge
about the
Naive Bayes classifier for multivariate Bernoulli models to know exactly
what the problem is
or if it's even a special case that is normal in the theory of the model -
so I would really appreciate
if anyone from the community can lend their expertise this way, if they can?
@Lars, if you are around and not to busy, I'd really appreciate your
thoughts on this.

If nobody who's already familiar with this can help, I'll read up a bit and
see if I can solve this.
If this is just normal behavior of this type of model we should put a
error-message.. but it would
be nice if it can be fixed.

Thanks, @JP for bringing this to our attention.
Kind Regards
Jaques




I do not know the theory behind the Naive Bayes classifier for multivariate
Bernoulli models very well,

_count(X, Y):
 self.feature_log_prob_ = (np.log(N_c_i + self.alpha)
                                - np.log(N_c.reshape(-1, 1)
                                       + self.alpha * X.shape[1]))

contrack Lars
maybe it's a property of the model.. include error perhaps.. or we can fix
it


2012/8/4 JP <[email protected]>

> Hi there at SCIKIT
>
> First time user/poster here, but I'd like to thank you for this useful
> piece of software.
>
> Using scikit-learn 0.10
>
> Why does the following (pastebin: http://pastebin.com/Hufs6aZJ):
>
> from sklearn.naive_bayes import *
>
> import sklearn
> from sklearn.naive_bayes import *
>
> print sklearn.__version__
>
> X = np.array([ [1, 1, 1, 1, 1],
>                [0, 0, 0, 0, 0] ])
> print "X: ", X
> Y = np.array([ 1, 2 ])
> print "Y: ", Y
>
> clf = BernoulliNB()
> clf.fit(X, Y)
> print "Prediction:", clf.predict( [0, 0, 0, 0, 0] )
>
>
> Print out an answer of "1" ?  Having trained the model on [0,0,0,0,0] => 2
> I was expecting "2" as the answer.
> And why does replacing Y with
>
> Y = np.array([ 3, 2 ])
>
> Give a different class "2" as an answer (the correct one) ?  Isn't this
> just a class label?
>
> Can someone shed some light on this?
>
> Many Thanks
> JP
>
> PS Thanks to the IRC people who tried to help (NelleV)
>
>
> -
> Jean-Paul Ebejer
> Early Stage Researcher
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to