Hi,

On May/07/2010, Thomas Dunham wrote:
> Thanks Carles, will try to force some of this into my head this
> weekend....

good!

For me one of the keys is in:

http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Just above "Using the Bayesian result".

I can read the formulas like:
-Probability of the document being spam is the multiplicatoin of each
individual word of this document being spam

Also interesting here:
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
When talks about "Combining individual probabilities"
(talks about the assumptions and links to the previous Wikipedia
article)

Other key is in the file reverend/thomas.py, buildCache, where it
computes the probability of each token to belong in each group. The
thing is that there is doing some "magic" with the metrics that, at the
moment, I'm not following very well (what it does and why is needed).

So, in a very high level does:
Training:
-Tokenize the input
-Save how many times appears each word in the corpus

buildCache (so, part of guessing if no more training is done):
-Calculates, per token, how likely is to be in each category (and
something else that I'm not following with the good and badMetric

guesser:
-Tokenize the new input
-Combines the probabilities of each token of the input, using the cache
to know how likely is this token to be of each category.

I think that this is a very high level design with some mistake for
sure.

If someone can calculate one example by hand and the result is the same
than Reverend would get some extra points :-D I'm only quite confused
with some things in buildCache...

-- 
Carles Pina i Estany
        http://pinux.info
_______________________________________________
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk

Reply via email to